diff --git a/docs/site/development/development-process/design-proposals/index-characters.md b/docs/site/development/development-process/design-proposals/index-characters.md new file mode 100644 index 00000000000..50acb476d32 --- /dev/null +++ b/docs/site/development/development-process/design-proposals/index-characters.md @@ -0,0 +1,137 @@ +--- +title: Index Characters +--- + +# Index Characters + +| | | +|---|---| +| Author | Mark Davis | +| Date | 2009-06-23 | +| Status | Accepted | +| Bugs | [2224](http://www.unicode.org/cldr/bugs-private/locale-bugs-private/data?id=2224) | + +This is a proposal for structure to allow for index characters for UIs, on a per-language basis. + +## Goal + +Index characters are an ordered list characters for use as a UI "index", that is, a list of clickable characters (or character sequences) that allow the user to see a segment of a larger "target" list. That is, each character corresponds to a bucket in the target list. One may have different kinds of index lists; one that produces an index list that is relatively static, and the other is a list that produces roughly equally-sized buckets. While we are mostly focused on the first, there is provision for supporting the second as well. + +The static list would be presented as something like the following (either vertically or horizontally): + +… A B C D E F G H CH I J K L M N O P Q R S T U V W X Y Z … + +Under "A" you would find all items that are greater than or equal to "A" in collation order, and less than any other item that is greater than "A". The use of the list requires that the target list be sorted according to the locale that is used to create that list. The … items are special, and is a bucket for everything else, either less or greater. Although we say "character" above, the index character could be a sequence, like "CH" above. + +In the UI, an index character could also be omitted or grayed out if its bucket is empty. For example, if there is nothing in the bucket for Q, then Q could be omitted. That would be up to the implementation. Additional buckets could be added if other characters are present. For example, we might see something like the following: + +| Sample Greek Index | Contents | +|:---:|---| +| Α Β Γ Δ Ε Ζ Η Θ Ι Κ Λ Μ Ν Ξ Ο Π Ρ Σ Τ Υ Φ Χ Ψ Ω | With only content beginning with Greek letters | +| … Α Β Γ Δ Ε Ζ Η Θ Ι Κ Λ Μ Ν Ξ Ο Π Ρ Σ Τ Υ Φ Χ Ψ Ω … | With some content before or after | +| … 9 Α Β Γ Δ Ε Ζ Η Θ Ι Κ Λ Μ Ν Ξ Ο Π Ρ Σ Τ Υ Φ Χ Ψ Ω … | With numbers, and nothing between 9 and Alpha | +| … 9 A-Z Α Β Γ Δ Ε Ζ Η Θ Ι Κ Λ Μ Ν Ξ Ο Π Ρ Σ Τ Υ Φ Χ Ψ Ω … | With numbers, some Latin | + +## Proposal + +Because these have to be in collation order, the specification of the list can be an unordered set. So the proposal is to add a new kind of exemplar set to CLDR. Unlike other exemplar sets, the case is significant (we may have to change some of the spec language), since it is the preferred case for the language (although it can be changed, according to the implementation). For example, the above could be represented as: + + \ + +  \[A-Z {CH}]\ + + \ + +Note that these are *not* simply uppercase versions of the exemplar characters, such as for Greek: + + \ + +  \[α ά β-ε έ ζ η ή θ ι ί ϊ ΐ κ-ο ό π ρ σ ς τ υ ύ ϋ ΰ φ-ω ώ]\ + + \ + +There is an optional data structure that can be used in some locales. All of these items are optional. + +\ + + \\ \ + + \\ •\ \ + + \{0}-{1}\ + + \\<1劃\ + + \\>24劃\ + + \1劃\ + + \2劃\ + + \3劃\ + +... + +... + +\ + +The index Separator can used to separate the index characters if they occur in free flowing text (instead of, say, on buttons or in cells). The default (root) is a space. Where the index is compressed (by omitting values -- see the priority attribute below), the compressedIndexSeparator can be used instead. + +The indexRangePattern is used for dynamic configuration. That is, if there are few items in X, Y, and Z, they can be grouped into a single bucket with \{0}-{1}\, giving "X-Z". The indexLabel and either be applied to a single string from the exemplars, or to the result of an indexRangePattern; so the localizer can turn "X-Z" into "XYZ" if desired. + +The indexLabelBefore and After are used before and after a list. The default (root) value is an elipsis, as in the example at the top. When displaying index characters with multiple scripts, the main language can be used for all characters from the main script. For other scripts there are two possibilities: + +1. Use the primary characters from the UCA. This has the disadvantage that many very uncommon characters show up. +2. Use the likely-subtags language for each scripts. For example, if the main language is French, and Cyrillic characters are present, then the likely subtags language for Cyrillic is "ru" (derived by looking up "und-Cyrl"). + +The indexLabel is used to display characters (if it is available). That is, when displaying index characters, if there is an indexLabel, use it instead. For example, for Hungarian, we could have A => "**A, Á**". The priority is used where not all of the index characters can be displayed. In that case, only the higher priorities (lower numbers) would be displayed. + +Note that the indexLabels can be used both with contiguous ranges and non-contiguous ranges. For German we might have [A-S Sch Sci St Su T-Z] as the index characters, and the following labels: + + + +\S\ + +\S\ + +What that means is that the "S" bucket will include anything [S,Sch), [Sci,St), and [Su,T). That is, items are put into the first display bucket that contains them. That allows for the desired behavior in German (and other languages) of: + +- S + - Satt + - Semel + - Szent +- Sch + - Scherer + - Schoen +- St + - Stumpf + - Sturr + +The indexLabel elements would be added by the TC, not localizers, since they are more complex. + +### Sorting variants + +Note that the choice of exemplars may vary with the sorting sequence used. So there is an extra attribute for use in those languages where a non-standard sorting can be used, and the index characters need to be different. + +\...\ + +## Automatic Generation + +For CLDR 1.8, an initial set of index characters has been automatically generated. Translators can tune as necessary. + +The automatic generation takes each of the exemplar characters from CLDR. It then sorts them, putting characters that are the same at a primary level into the same bucket. It then picks one item from each bucket as the representative. Combining sequences are dropped. Korean is handled specially. The large exemplar sets (Japanese, Chinese) need to be done with consultation with translators. + +### Representatives + +Where multiple character sequences sort the same at a primary level, the automatic generation tries to pick the "best" in the following way: + +- Prefer the titlcase versions (A, Dz,...) +- If the NFKD form is shorter, use it. +- If the NFKD form is less (according to the collator but with strength = 3), use it. +- If the binary comparison is less, use it. + +*WARNING: the automatic generation would only be a draft, for translators to tune, so any shortcomings could be fixed.* + + +![Unicode copyright](https://www.unicode.org/img/hb_notice.gif) \ No newline at end of file diff --git a/docs/site/development/development-process/design-proposals/islamic-calendar-types.md b/docs/site/development/development-process/design-proposals/islamic-calendar-types.md new file mode 100644 index 00000000000..3cf9dfb383e --- /dev/null +++ b/docs/site/development/development-process/design-proposals/islamic-calendar-types.md @@ -0,0 +1,161 @@ +--- +title: Islamic Calendar Types +--- + +# Islamic Calendar Types + +| Type | Intercalary years with 355 days in each 30-year cycle | +|---|---| +| I | 2, 5, 7, 10, 13, **15** , 18, 21, 24, 26, 29 | +| II | 2, 5, 7, 10, 13, 16, 18, 21, 24, 26, 29 | +| III | 2, 5, 8, 10, 13, 16, 19, 21, 24, **27** , 29 | +| IV | 2, 5, 8, **11** , 13, 16, 19, 21, 24, 26, **30** | + +| subtag | Old Definition | New Definition | Comments | +|---|---|---|---| +| islamic | Astronomical Arabic calendar | Islamic calendar | Redefined as "generic" Islamic calendar. This type does not designate any specific Islamic calendar algorithm variants.

The old term "Arabic" is sometimes used specifically for Islamic calendar variants used in Arabic countries. Islamic (or Hijiri) would be more generic term which can be applicable to the calendar variants in Turkey, Malay and other locations. | +| islamicc | Civil (algorithmic) Arabic calendar | *<deprecated>* | Deprecated because it does not have the desired structure - replaced with "islamic-civil" | +| islamic-civil (previously an alias of "islamicc") | < not available > | Islamic calendar, tabular (intercalary years [2,5,7,10,13,16,18,21,24,26,29] Friday epoch) | "islamic-civil" was defined as an alias (legacy non-BCP type) of "islamicc" above. This proposal reuses the legacy type, instead of introducing a brand new subtag (such as "islamic-tblc" - tabular/civil) | +| islamic-tbla (new) | *< not available >* | Islamic calendar, tabular (intercalary years [2,5,7,10,13,16,18,21,24,26,29] Thursday epoch) | "tbla" means - Tabular / Thursday epoch (also known as astronomical epoch). Using the term "astronomical" would introduce the confusion with other variants (all sighting based variants are based on astronomical observation). | +| islamic-umalqura | *< not available >* | Islamic calendar, Umm al-Qura | | +| islamic-rgsa (new) | *< not available >* | Islamic calendar, Saudi Arabia sighting | Requested by Oracle. In future, we may need to add more regional sighting variants. In that case, we propose to use the syntax - "rg" (region) + country/region code. | + +## Tickets + +[#5525](http://unicode.org/cldr/trac/ticket/5525) Disambiguation of Islamic calendar variants + +## Background + +In CLDR, we have two Islamic calendar types available - "islamic" (astronomical?) and "islamic-civil". They originally came from the ICU implementations that seems to be largely influenced by the book - Calendrical Calculations, Nachum Dershowitz, Edward M. Reingold. In this book, "civil" calendar uses the simple arithmetical algorithm based on CE 622 July 16 Friday (Julian calendar). ICU's "islamic" (astronomical) is based on astronomical calculation of moon phase based on a certain location. The intent of latter is to provider better approximation of Hijri calendar actually used by countries, but it's not quite perfect. + +Microsoft also provides two calendar types. In .NET, these are named HijriCalendar and UmAlQuraCalendar. The former was introduced a long time ago. According to this page [http://www.staff.science.uu.nl/~gent0113/islam/islam\_tabcal.htm], the algorithm used by MS's "HijriCalendar" is nothing more than simple arithmetic one, that is pretty similar to ICU's "islamic-civil" calendar. I compared MS's "HijriCalendar" with ICU's "islamic-civil" calendar side by side and I found ICU's islamic-civil is always one day after the MS's HijriCalendar. My understanding is that MS's implementation just use CE 622 July 15 (Julian) as the epoch date, that is one day before the ICU's implementation. + +*Note: The link in the above reference page [*http://www.staff.science.uu.nl/~gent0113/islam/islam\_tabcal\_variants.htm*] indicates MS's "Kuwaiti Algorithm" is type I (using Intercalary years with 355 days in each 30-year cycle with 2,5,7,10,13,15,18,21,24,26 &29), but HijriCalendar class on .NET 4.5 seems to use type II (2,5,7,10,13,16,18,21,24,26 & 29).* + +MS's UmAlQuraCalendar supports Umm al-Qura calendar introduced by Saudi Arabia [http://www.staff.science.uu.nl/~gent0113/islam/ummalqura.htm]. + +Recently, JSR-310 (formerly - Joda Time) was approved and the whole new date/time package will be integrated into Java 8. JSR-310 folks want to identify calendar type using BCP 47 locale extension. So they proposed to [add a few islamic calendar types](http://unicode.org/cldr/trac/ticket/5525) - one of Um al-Qura, one for Saudi Arabia sighting. + +## Islamic Calendar Variants + +Traditionally, the first day of month is the day of the first sighting of the hilal (crescent moon) shortly after sunset. Because of the sighting of the hilal is affected by various factors not predictable (clouds or brightness of the sky, geographic location and others), it is impossible to determine dates by calculation. This traditional practice is still followed in the majority of Muslim countries. In this proposal, we categorize all of sighting based Islamic calendar as a group of religious calendars. + +Because the religious calendar date is not determined precisely, there were several algorithms developed for non-religious purposes. A group of Islamic calendar algorithm defined by simple rules without astronomical calculation is called tabular Islamic calendars. Some of them are used for calculating approximate dates and some are used for administrative purpose in some locations. ICU's civil calendar is one of these and Microsoft's "HijriCalendar" is another one in this group. + +According to RH van Gent, there are several known variations of tabular calendar schemes [[http://www.staff.science.uu.nl/~gent0113/islam/islam\_tabcal.htm](http://www.staff.science.uu.nl/~gent0113/islam/islam_tabcal.htm)]. + +Also, there are two possible epoch dates for each scheme. + +- Epoch type 'a' ("astronomical" or Thursday epoch): CE 622-07-15 (Julian) +- Epoch type 'c' ("civil" or "Friday" epoch): CE 622 07-16 (Julian) + +With his categorization scheme, Microsoft's "HijiriCalendar" is Type IIa (Note: He originally categorized this one as type Ia - but he updated it to type IIa recently). ICU's civil calendar is Type IIc. + +There is another algorithmic calendar called Umm Al-Qura. This algorithm is used by Saudi Arabia for administrative purpose. Unlike tabular calendars, the algorithm involves astronomical calculation, but it's still deterministic. Umm Al-Qura is also supported by Islamic communities in North America and Europe. This algorithm is implemented by Microsoft as "UmAlQuraCalendar". There is a request to implement the algorithm in ICU ([#8449](http://bugs.icu-project.org/trac/ticket/8449) Add Um Alqura Hijri Calendar Support). + +So Islamic calendar variants are categorized as below: + +- Religious: Based on the sighting of the hilal. Actual dates varies by location. +- Algorithmic + - Tabular: ICU "civil", Microsoft "HijriCalendar" + - Umm Al-Qura: Saudi Arabia and others, Microsoft "UmAlQuraCalendar" + +## Hierarchical Calendar Type Subtags + +CLDR contains a lot of locale data for formatting dates associated with various calendar types. But such calendar algorithm variants are irrelevant to resolving date format symbols and patterns. For example, the same instant may fall on to different dates depending on what Islamic algorithm variant is actually used. However, when formatting code creates a text representation of the dates, symbols (month names, day of week names ...) and patterns are most likely shared by all Islamic calendar variants. + +This proposal will introduce a new policy in the calendar type name space. + +- A calendar type might be represented by multiple subtags +- When a calendar type consists from multiple subtags, corresponding formatting data might be resolved by prefix match + +For example, for a given calendar type "ca-xxx-yyy-zzz", if formatting data for type "xxx-yyy-zzz" is not available, them the formatting data for type "xxx-yyy" is used as the fallback, then finally "xxx". + +## Proposed Calendar Type Value Changes + +The table below specifies all of the proposed changes. + +Below is the part of the actual XML contents (common/bcp47/calendar.xml) + +\ + + \ + +  \ + +   \ + +   \ + +   \ + +   \ + +   \ + +   \ + +  \ + + \ + +\ + +**Note: Following sections including the old proposal and the discussions are preserved for reference purposes only** + +### Calendar Type Keywords + +In the CLDR technical committee meeting on 2013-01-02, we thought it's not a good idea to keep adding Islamic calendar variants. CLDR contains a bunch of formatting data associated with calendar type, but these Islamic calendar variants do nothing with formatting - they actually share the same formatting data. From this aspect, CLDR TC members prefer to support these variants using a separate extension. + +For now, + + **ca-islamic**: Islamic religious calendar (ICU implementation is based on astronomical simulation of moon movement) + + **ca-islamicc**: Islamic civil calendar + +After investigating various Islamic calendar algorithm and practical usage, we're probably going to **deprecate "islamicc"** as calendar type. In addition to "ca" (calendar) type, we would **introduce "cv" (calendar variant)** to distinguish one algorithm from another.So, all Islamic calendar variations share "ca-islamic", but "cv-xxx" to distinguish a variant from others. For example, + + **ca-islamic-cv-umalqura** : Islamic calendar / Umm al-Qura + +### Proposed cv (Calendar Variant) Values + +Below are the proposed calendar algorithmic variant values in this proposal + +1. **umalqura** - Umm Al Qura calendar of Saudi Arabia +2. **tbla** - Tabular Islamic calendar with leap years in 2nd, 5th, 7th, 13th, 16th, 18th, 21st, 24th, 26th and 29th year in each 30-year cycle with the Thursday ('a' - astronomical) epoch(Microsoft Hijri Calendar) +3. **tblc** - Tabular Islamic calendar with leap years in 2nd, 5th, 7th, 13th, 16th, 18th, 21st, 24th, 26th and 29th year in each 30-year cycle with the Friday ('c' - civil) epoch ([Calendrica](http://emr.cs.iit.edu/home/reingold/calendar-book/Calendrica.html) Islamic - Arithmetic) + +In addition to above, other tabular calendar like Fatimid ("tbl27a" or "tbl27c") might be added if necessary. (See wikipedia article: [Tabular Islamic calendar](http://en.wikipedia.org/wiki/Tabular_Islamic_calendar)) + +For the religious (sighting) calendar, we could explicitly represent the location. For now, JSR-310 community wants one for Saudi Arabia. We could define such variants by defining a syntax like \<2-letter county code> + \, for example "cv-sa0" (sa = Saudi Arabia), but the idea was rejected by the CLDR TC in the CLDR TC meeting on 2013-Jan-30 for following reasons: 1) the subtag registry should define the exact list whenever possible, 2) such syntax may allow subtags that are not really used - for example, jp0 (jp = Japan) is practically useless, and 3) the region might not be presented by country code. The CLDR TC's agreement was to register "rgsa" for Saudi Arabia (prefix "rg" (region) + 2-letter country code "sa") explicitly. We may add other regions based on requests in future using the same convention. So in addition to algorithmic variants above, this proposal includes: + +1. **rgsa** - Islamic calendar variant for Saudi Arabia (based on sighting) + +This proposal will deprecate the use of ca-islamicc (or islamic-civil in long format). With this proposal, keywords are mapped as below. + +| | | | +|---|---|---| +| New | Old | Semantics | +| ca-islamic | ca-islamic | Islamic calendar (variation of calendar algorithm is irrelevant) | +| ca-islamic-cv-tblc | ca-islamicc | Islamic tabular calendar based on the Calendrica arithmetic algorithm. (We should probably avoid the term - Islamic Civil Calendar) | + +### Impacts + +The impacts of the proposed changes are minimal. The proposed changes are mostly in bcp47/calendar.xml and the LDML specification. We could remove format data alias used for islamicc (islamic-civil) currently in root.xml. + +**Counter Proposal: Calendar Type ("ca") with hierarchical subtags** + +Mark (who proposed a new type "cv" originally) raised a concern about the calendar type specified by the combination of "ca" and "cv". His main argument is: A software dealing with calendar algorithm eventually need to read/store calendar type in some way. For this purpose, single combined string is more convenient than two strings stored in different keywords. With his counter proposal, a calendar variant will be specified as the second subtag after the main calendar type. For example, + +ca-**islamic**-cv-**umalqura** -> ca-**islamic-umalqura** + +Technically, both original proposal (ca-xxx-cv-yyy) and Mark's counter proposal (ca-xxx-yyy) agree that main calendar type and minor algorithm differences (not affecting formatting) are semantically separated. The question is whether if these two items should be physically separated or not. I'm writing down **pros**/**cons** of both approaches. + +Note that we may get requests for some other calendar types/variations such as: + +- Julian calendar +- Gregorian/Julian calendar with a certain switch over date +- Turkish variant of Islamic calendar +- Other regional variants of Islamic calendar + +![Unicode copyright](https://www.unicode.org/img/hb_notice.gif) \ No newline at end of file diff --git a/docs/site/development/development-process/design-proposals/iso-636-deprecation-requests-draft.md b/docs/site/development/development-process/design-proposals/iso-636-deprecation-requests-draft.md new file mode 100644 index 00000000000..8a4a1dfd952 --- /dev/null +++ b/docs/site/development/development-process/design-proposals/iso-636-deprecation-requests-draft.md @@ -0,0 +1,19 @@ +--- +title: ISO 636 Deprecation Requests - DRAFT +--- + +# ISO 636 Deprecation Requests - DRAFT + +We have become aware over time of cases where ISO 639 inaccurately assigns different language codes to the same language. Its goal is to distinguish all and only those languages that are are not mutually comprehensible. Making too many distinctions can be as harmful as making too few, since it artificially separates two dialects, and disrupts the ability of software to identify them as variants. The remedy used in the past has been to deprecate codes: for example, [mol (mo)](http://www.sil.org/iso639-3/documentation.asp?id=mol) has been merged with [rol (ro)](http://www.sil.org/iso639-3/documentation.asp?id=rol). See also [Picking the Right Language Code](https://cldr.unicode.org/index/cldr-spec/picking-the-right-language-code) and [Language Distance Data](https://cldr.unicode.org/development/development-process/design-proposals/language-distance-data). + +The current cases in question are listed below. However we need to collate and organize a background document of information before we go further. + +| Codes | Alternates | Comments | Recomended disposition | +|---|---|---|---| +| [aka (ak)](http://www.sil.org/iso639-3/documentation.asp?id=aka) Akan | [fat](http://www.sil.org/iso639-3/documentation.asp?id=fat) Fanti; [twi](http://www.sil.org/iso639-3/documentation.asp?id=twi) Twi | Sources in Africa confirm what [wikipedia](http://en.wikipedia.org/wiki/Akan_language) says: that Fanti and Twi are mutually comprehensible, and both are considered Akan. | Deprecate 'fat' and 'twi'; add the names "Fanti" and "Twi" to 'aka' | +| [fas (fa)](http://www.sil.org/iso639-3/documentation.asp?id=fas) Persian | [pes](http://www.sil.org/iso639-3/documentation.asp?id=pes) Western Farsi; [prs](http://www.sil.org/iso639-3/documentation.asp?id=prs) Dari | Again, native speakers confirm that Dari and Farsi are mutually comprehensible, and Dari is simply the name given to Farsi in Afganistan and other places. That is, in RFC 4646 parlance, Dari and Western Farsi are as close as, es-ES and es-AR; fa-AF and prs are essentially synonyms. | Deprecate 'pes' and 'prs'; add the names to 'fas' | +| [tgl (tl)](https://479453595-atari-embeds.googleusercontent.com/embeds/16cb204cf3a9d4d223a0a3fd8b0eec5d/goog_1243893892557) Tagalog | [fil](http://www.sil.org/iso639-3/documentation.asp?id=fil) Filipino | These are widely recognized to be mutually comprehensible. There appear to be only political reasons for separating them. See http://en.wikipedia.org/wiki/Filipino_language , which is corrobborated by our native speaker contacts. | Deprecate 'fil'; adding the name "Filipino" to 'tgl' | +| [hbs](http://www.sil.org/iso639-3/documentation.asp?id=hbs) (sh) Serbo-Croatian | [bos](http://www.sil.org/iso639-3/documentation.asp?id=bos) (bs) Bosnian; [hrv](http://www.sil.org/iso639-3/documentation.asp?id=hrv) (hr) Croatian; [srp](http://www.sil.org/iso639-3/documentation.asp?id=srp) (sr) Serbian | These are all mutually comprehensible according to many native speakers. | Ideally, we would deprecate bos, hrv, srp; add the names to 'hbs'; however, there is probably too much installed base to do this. | + + +![Unicode copyright](https://www.unicode.org/img/hb_notice.gif) \ No newline at end of file diff --git a/docs/site/development/development-process/design-proposals/json-packaging-approved-by-the-cldr-tc-on-2015-03-25.md b/docs/site/development/development-process/design-proposals/json-packaging-approved-by-the-cldr-tc-on-2015-03-25.md new file mode 100644 index 00000000000..249716d3c46 --- /dev/null +++ b/docs/site/development/development-process/design-proposals/json-packaging-approved-by-the-cldr-tc-on-2015-03-25.md @@ -0,0 +1,67 @@ +--- +title: JSON Packaging (Approved by the CLDR TC on 2015-03-25) +--- + +# JSON Packaging (Approved by the CLDR TC on 2015-03-25) + +***This page is out of date, see*** [***http://cldr.unicode.org/index/downloads#TOC-JSON-Data***](http://cldr.unicode.org/index/downloads#TOC-JSON-Data) ***for details of the JSON data.*** + +## Overview + +CLDR's official locale data is, and probably always will be, published in XML format. However, JSON format data is quickly becoming a popular alternative, especially for JavaScript applications. Currently, when a CLDR release is published, we generate the corresponding JSON data ( according to http://cldr.unicode.org/index/cldr-spec/json ) and include additional zip files as part of the CLDR distribution. While the distribution of the data in this way is helpful to some, it would be even more useful for us to create a series of packages that are installable using bower ( see http://bower.io/ for details ). In this way, applications that desire to use the CLDR data in JSON format can simply specify the appropriate packages as a prerequisite, thus eliminating the need to copy and redistribute the data. + +### Design Goals + +- Data grouped by functionality - The design should allow people to install packages that contain the data they intend to use, while hopefully keeping the number of packages to a minimum. +- Tiered locale coverage - For the packages created, we can use the CLDR tiers as specified in \ to define an incrementally larger set of locales for each functional package. For each such package, a package containing additional locales can be defined as a proper superset of the corresponding smaller package. Because bower's install mechanism does not allow data from multiple packages to be put in the same directory, each "full" package will contain the complete set of locales that were defined in the corresponding "modern" package, so that applications don't have to have similar data for different locales residing in two different places. The set of locales contained in each tier would be consistent across all the functional packages. +- Default content locales - In order to keep the data size to a minimum, JSON data for default content locales will not be included in the installable packages. Since all default content locales can retrieve their data from the parent via simple inheritance ( i.e. removal of the rightmost portion of the language tag ), this should be relatively easy for most JavaScript applications to determine the appropriate location from which to retrieve the data. A new JSON file "defaultContent.json" shall be included in CLDR's as part of the cldr-core package, in the top level directory, that will contain the list of default content locales for the release. + +### Functional Groups and Packages + +The existing JSON data for CLDR shall be split into functional groupings as follows: + +- **core** - This package will contain a core JSON file "availableLocales.json", in the top level directory, which will outline, by tier, the CLDR locales that are included in each tiered locale coverage level. In addition, since most applications will make use of the supplemental data, and since the supplemental data is relatively small, CLDR's supplemental data directory shall be included in the "core" package. +- **dates** - Package for basic date/time processing. For each available locale, contains "ca-generic.json", "ca-gregorian.json", "dateFields.json", and "timeZoneNames.json", the required set for Gregorian calendar support. Any packages for non-gregorian calendars should list the corresponding dates package as a prerequisite. +- **cal-{type}** - Package for non-gregorian locale support. For each available locale, contains "ca-{type}.json", where {type} is one of the supported CLDR calendar types: buddhist, chinese, coptic, dangi, ethiopic, hebrew, indian, islamic, japanese, persian, or roc. Note: all required json for calendar variants shall be contained in a single package. For example, the "cal-islamic" package would contain, for each supported locale, "ca-islamic.json", "ca-islamic-civil.json", "ca-islamic-rgsa.json", "ca-islamic-tbla.json", and "ca-islamic-umalqura.json". +- **localenames** - Package for display of locale names, language/territory names, etc. For each available locale, contains "localeDisplayNames.json", "languages.json", "scripts.json", "territories.json", "transformNames.json", and "variants.json". +- **misc** - Package for miscellaneous JSON that doesn't fall into any other functional category. For each included locale, contains characters.json, contextTransforms.json, delimiters.json, layout.json, listPatterns.json, and "posix.json". +- **numbers** - Package containing data necessary for proper formatting of numbers and currencies. For each available, contains currencies.json and numbers.json. The "dates" package should list the "numbers" package as a requisite, since numbers information is often required in order to format dates properly. +- **units** - Package containing data pertaining to units. For each included locale, contains units.json and measurementSystemNames.json. +- **segments** - Contains the segments data (from the unicode ULI project). Contains "segments/{locale}/suppressions.json", where {locale} is one of the locale identifiers that has segmentation information. + +### Locale Coverage Tiers + +For each functional group listed above (with the exception of the "core" package), data packages shall be created that define successively larger numbers of locales. The tiers shall be based on the coverage information used in the CLDR survey tool, as defined in http://unicode.org/repos/cldr/trunk/tools/java/org/unicode/cldr/util/data/Locales.txt . For any given language, all the valid CLDR sublocales for that language shall be in package. The following tiers shall be created. + +- **modern** - (Depends on the "core" package only) - Contains those locales specified for "modern" coverage in Locales.txt in the "Cldr" organization. Includes those locales listed in the "tier1", "tier2", "tier 3", "tier 4", "generated", "ext", and "other" sections. +- **full** - (Depends on the corresponding "modern" package) - Contains all remaining locales published in CLDR's main directory, that aren't contained in one of the tiers above. + +### Summary Table of Packages and Dependencies + +| Package Name | Depends On | +|---|---| +| cldr-core | <nothing> | +| cldr-dates-modern | cldr-numbers-modern | +| cldr-dates-full | cldr-dates-modern, cldr-numbers-full | +| cldr-cal-buddhist-modern | cldr-dates-modern | +| cldr-cal-buddhist-full | cldr-dates-full | +| ...<similar pattern for other cldr-cal-* packages> | | +| cldr-localenames-modern | cldr-core | +| cldr-localenames-full | cldr-localenames-modern | +| cldr-misc-modern | cldr-core | +| cldr-misc-full | cldr-misc-modern | +| cldr-numbers-modern | cldr-core | +| cldr-numbers-full | cldr-numbers-modern | +| cldr-units-modern | cldr-core | +| cldr-units-full | cldr-units-modern | +| cldr-segments-modern | cldr-core | + +Locales by Tier as of CLDR 26 (for reference purposes only) + +| **Tier** | **Locales** | +|---|---| +| modern | af, am, ar, az, bg, bn, bs. ca, cs, da, de, en, el, es, et, eu, fa, fil, fi, fr, gl, gu, he, hi, hu, hy, id, is, it, ja, ka, kk, km, kn, ko, ky, lo, lt, lv, mk, ml, mn, ms, mr, my, nb, ne, nl, pa, pl, pt, ro, ru, si, sr, sk, sl, sq, sv, sw, ta, te, th, tr, uk, ur, uz, vi, zh, zu | +| full | All other locales. | + + +![Unicode copyright](https://www.unicode.org/img/hb_notice.gif) \ No newline at end of file diff --git a/docs/site/development/development-process/design-proposals/language-data-consistency.md b/docs/site/development/development-process/design-proposals/language-data-consistency.md new file mode 100644 index 00000000000..3f064e70548 --- /dev/null +++ b/docs/site/development/development-process/design-proposals/language-data-consistency.md @@ -0,0 +1,31 @@ +--- +title: Language Data Consistency +--- + +# Language Data Consistency + +We have a set of tests for consistency in the data for language, script, and country. The following is a draft description what those consistency checks should aim for. + +## Default script, language + +- 1. For each script encoded in Unicode, the default\* language is present in the script metadata. +2. For each language used in CLDR, there is a default\* script + +\* default = most used in writing; currently if modern, otherwise historical. + +## Implications for Language-Country population data (LCPD) + +1. If a base-language has a CLDR locale, then it is in the LCPD for at least one country. +2. If there is a CLDR country locale for a language, then that language+country is in the LCPD. + 1. For each country, get the language most widely used as a written language in that country. That language+country combination is in the LCPD. + 2. When a significant proportion of the language use in a country is in a non-default script, that script is marked in the LCPD. + 3. When a script is not EXCLUDED in UAX#31, then we have at least one language-country pair in the LCPD. +3. If a language has a significant\* literate population in a country, the pair is in the LCPD. This target is fuzzier, but definitely + 1. anything \>1M, or + 2. \>100K and either official (real, not honorary) or 1/3 of the population. + +## Implications for Likely Subtags + +Likely Subtags are built from the language-country population data, plus the script metadata, plus an exception list. + +![Unicode copyright](https://www.unicode.org/img/hb_notice.gif) \ No newline at end of file