From 1a4703957d214596815a28ee54603629c890508b Mon Sep 17 00:00:00 2001 From: Peter Edberg Date: Mon, 27 Mar 2023 21:33:14 -0700 Subject: [PATCH] CLDR-9669 Improve some spec info about effect of locale keywords --- docs/ldml/tr35-dates.md | 4 ++- docs/ldml/tr35-general.md | 11 ++++++- docs/ldml/tr35-numbers.md | 4 ++- docs/ldml/tr35.md | 68 +++++++++++++++++++++++++++++++-------- 4 files changed, 70 insertions(+), 17 deletions(-) diff --git a/docs/ldml/tr35-dates.md b/docs/ldml/tr35-dates.md index 62c5095a496..9abf11d405f 100644 --- a/docs/ldml/tr35-dates.md +++ b/docs/ldml/tr35-dates.md @@ -1145,7 +1145,7 @@ These values provide territory-specific information needed for week-of-year and In order for a week to count as the first week of a new year for week-of-year calculations, it must include at least the number of days in the new year specified by the minDays value; otherwise the week will count as the last week of the previous year (and for week-of-month calculations, `minDays` also specifies the minimum number of days in the new month for a week to count as part of that month). -The day indicated by `firstDay` is the one that should be shown as the first day of the week in a calendar view. This is not necessarily the same as the first day after the weekend (or the first work day of the week), which should be determined from the weekend information. Currently, day-of-week numbering is based on `firstDay` (that is, day 1 is the day specified by `firstDay`), but in the future we may add a way to specify this separately. +The day indicated by `firstDay` is the one that should be shown as the first day of the week in a calendar view. This is not necessarily the same as the first day after the weekend (or the first work day of the week), which should be determined from the weekend information. Currently, day-of-week numbering is based on `firstDay` (that is, day 1 is the day specified by `firstDay`), but in the future we may add a way to specify this separately. The `firstDay` value determined from the region can be overridden by the locale keyword "fw", see [Unicode First Day Identifier](tr35.md#UnicodeFirstDayIdentifier). What is meant by the weekend varies from country to country. It is typically when most non-retail businesses are closed. The time should not be specified unless it is a well-recognized part of the day. The `weekendStart` day defaults to "sat", and `weekendEnd` day defaults to "sun". For more information, see _[Dates and Date Ranges](tr35.md#Date_Ranges)_. @@ -1199,6 +1199,8 @@ The B and b date symbols provide for formats like “3:00 at night”. When the Some systems may not want to use B and b, even if preferred for the locale, so for compatibility the `preferred` value is limited to {H, h, K, k}, and is the option selected by the ‘j’ date symbol. Thus the `preferred` value may not be the same as the first `allowed` value. +The preferred value for the locale can be overridden by the locale keyword "hc", see [Unicode Hour Cycle Identifier ](tr35.md#UnicodeHourCycleIdentifier). + ### Day Period Rule Sets ```xml diff --git a/docs/ldml/tr35-general.md b/docs/ldml/tr35-general.md index 213e1ebbd91..8d0455a975d 100644 --- a/docs/ldml/tr35-general.md +++ b/docs/ldml/tr35-general.md @@ -1481,7 +1481,14 @@ The references section supplies a central location for specifying references and The `segmentations` element provides for segmentation of text into words, lines, or other segments. The structure is based on [[UAX29](https://www.unicode.org/reports/tr41/#UAX29)] notation, but adapted to be machine-readable. It uses a list of variables (representing character classes) and a list of rules. Each must have an `id` attribute. -The rules in _root_ implement the segmentations found in [[UAX29](https://www.unicode.org/reports/tr41/#UAX29)] and [[UAX14](https://www.unicode.org/reports/tr41/#UAX14)], for grapheme clusters, words, sentences, and lines. They can be overridden by rules in child locales. +The rules in _root_ implement the segmentations found in [[UAX29](https://www.unicode.org/reports/tr41/#UAX29)] and +[[UAX14](https://www.unicode.org/reports/tr41/#UAX14)], for grapheme clusters, words, sentences, and lines. They can be +overridden by rules in child locales. In addition, there are several locale keywords that affect segmentation: + +* "dx", [Unicode Dictionary Break Exclusion Identifier](tr35.md#UnicodeDictionaryBreakExclusionIdentifier) +* "lb", [Unicode Line Break Style Identifier](tr35.md#UnicodeLineBreakStyleIdentifier) +* "lw", [Unicode Line Break Word Identifier ](tr35.md#UnicodeLineBreakWordIdentifier) +* "ss", [Unicode Sentence Break Suppressions Identifier ](tr35.md#UnicodeSentenceBreakSuppressionsIdentifier) Here is an example: @@ -2313,6 +2320,8 @@ The following `type` attributes are in use: In many languages there may not be a difference among many of these lists. In others, the spacing, the length or presence or a conjunction, and the separators may change. +Currently there are no locale keywords that affect list patterns; they are selected using the base locale ID, ignoring anu -u- extension keywords. + ### Gender of Lists ```xml diff --git a/docs/ldml/tr35-numbers.md b/docs/ldml/tr35-numbers.md index 71b70bb0287..322c75ab9bd 100644 --- a/docs/ldml/tr35-numbers.md +++ b/docs/ldml/tr35-numbers.md @@ -430,7 +430,7 @@ The following additional elements were intended to allow proper placement of the ``` -In addition to a standard currency format, in which negative currency amounts might typically be displayed as something like “-$3.27”, locales may provide an "accounting" form, in which for "en_US" the same example would appear as “($3.27)”. +In addition to a standard currency format, in which negative currency amounts might typically be displayed as something like “-$3.27”, locales may provide an "accounting" form, in which for "en_US" the same example would appear as “($3.27)”. The locale keyword "cf" can be used to select the standard or accounting form, see [Unicode Currency Format Identifier](tr35.md#UnicodeCurrencyFormatIdentifier). ```xml @@ -1092,6 +1092,8 @@ Plural categories may also differ according to the visible decimals. For example There are also variants of the above: for example, short fractions may have the Digits behavior, but longer fractions may just look at the final digit of the fraction. +Currently there are no locale keywords that affect plural rule selection; they are selected using the base locale ID, ignoring any -u- extension keywords. + #### Explicit 0 and 1 rules Some types of CLDR data (such as [unitPatterns](tr35-general.md#Unit_Elements) and [currency displayNames](#Currencies)) allow specification of plural rules for explicit cases “0” and “1”, in addition to the language-specific plural cases specified above: “zero”, “one”, “two” ... “other”. For the language-specific plural rules: diff --git a/docs/ldml/tr35.md b/docs/ldml/tr35.md index 5bf3b438d7b..14c8daf680d 100644 --- a/docs/ldml/tr35.md +++ b/docs/ldml/tr35.md @@ -715,7 +715,12 @@ The BCP 47 form for keys and types is the canonical form, and recommended. Other - + @@ -735,7 +740,11 @@ The BCP 47 form for keys and types is the canonical form, and recommended. Other - + @@ -766,7 +775,12 @@ The BCP 47 form for keys and types is the canonical form, and recommended. Other - + @@ -782,7 +796,11 @@ The BCP 47 form for keys and types is the canonical form, and recommended. Other - + @@ -793,7 +811,11 @@ The BCP 47 form for keys and types is the canonical form, and recommended. Other - + @@ -805,7 +827,11 @@ The BCP 47 form for keys and types is the canonical form, and recommended. Other - + @@ -815,7 +841,11 @@ The BCP 47 form for keys and types is the canonical form, and recommended. Other - + @@ -827,10 +857,13 @@ The BCP 47 form for keys and types is the canonical form, and recommended. Other - + @@ -840,8 +873,11 @@ The determination of preferred units depends on the locale identifer: the keys m - @@ -869,7 +905,11 @@ The determination of preferred units depends on the locale identifer: the keys m - +
key
(old key name)
key descriptionexample type
(old type name)
type description
A Unicode Calendar Identifier defines a type of calendar. The valid values are those name attribute values in the type elements of key name="ca" in bcp47/calendar.xml.
A Unicode Calendar Identifier + defines a type of calendar. The valid values are those name attribute values in the type elements of key name="ca" + in bcp47/calendar.xml.
+ This selects calendar-specific data within a locale used for formatting and parsing, such as date/time symbols and patterns; it also selects supplemental + calendarData used for calendrical calculations. +
"ca"
(calendar)
Calendar algorithm

(For information on the calendar algorithms associated with the data used with these, see [Calendars].)
"buddhist"
Note: Some calendar types are represented by two subtags. In such cases, the first subtag specifies a generic calendar type and the second subtag specifies a calendar algorithm variant. The CLDR uses generic calendar types (single subtag types) for tagging data when calendar algorithm variations within a generic calendar type are irrelevant. For example, type "islamic" is used for specifying Islamic calendar formatting data for all Islamic calendar types, including "islamic-civil" and "islamic-umalqura".
A Unicode Currency Format Identifier defines a style for currency formatting. The valid values are those name attribute values in the type elements of key name="cf" in bcp47/currency.xml.
A Unicode Currency Format Identifier + defines a style for currency formatting. The valid values are those name attribute values in the type elements of key name="cf" in + bcp47/currency.xml.
+ This selects the specific type of currency formatting pattern within a locale. +
"cf" Currency Format style "standard"Negative numbers use the minusSign symbol (the default).
ISO 4217 code,

plus others in common use

Codes consisting of 3 ASCII letters that are or have been valid in ISO 4217, plus certain additional codes that are or have been in common use. The list of countries and time periods associated with each currency value is available in Supplemental Currency Data, plus the default number of decimals.

The XXX code is given a broader interpretation as Unknown or Invalid Currency.

A Unicode Dictionary Break Exclusion Identifier specifies scripts to be excluded from dictionary-based text break (for words and lines). The valid values are of one or more items of type SCRIPT_CODE as specified in the name attribute value in the type element of key name="dx" in bcp47/segmentation.xml.
A Unicode Dictionary Break Exclusion Identifier specifies scripts to be excluded from dictionary-based text break + (for words and lines). The valid values are of one or more items of type SCRIPT_CODE as specified in the name attribute value in the type element of + key name="dx" in bcp47/segmentation.xml.
+ This affects break iteration regardless of locale. +
"dx" Dictionary break script exclusions unicode_script_subtag valuesUse a text presentation for emoji characters if possible.
"default"Use the default presentation for emoji characters as specified in UTR #51 Presentation Style.
A Unicode First Day Identifier defines the preferred first day of the week for calendar display. Specifying "fw" in a locale identifier overrides the default value specified by supplemental week data (see Part 4 Dates, Week Data). The valid values are those name attribute values in the type elements of key name="fw" in bcp47/calendar.xml.
A Unicode First Day Identifier + defines the preferred first day of the week for calendar display. Specifying "fw" in a locale identifier overrides the default value specified by supplemental + week data for the region (see Part 4 Dates, Week Data). The valid values are those name attribute values in the type elements + of key name="fw" in bcp47/calendar.xml. +
"fw" First day of week "sun"
"sat" Saturday
A Unicode Hour Cycle Identifier defines the preferred time cycle. Specifying "hc" in a locale identifier overrides the default value specified by supplemental time data (see Part 4 Dates, Time Data). The valid values are those name attribute values in the type elements of key name="hc" in bcp47/calendar.xml.
A Unicode Hour Cycle Identifier + defines the preferred time cycle. Specifying "hc" in a locale identifier overrides the default value specified by supplemental time data for the region + (see Part 4 Dates, Time Data). The valid values are those name attribute values in the type elements of + key name="hc" in bcp47/calendar.xml. +
"hc" Hour cycle "h12"
"h24" Hour system using 1–24; corresponds to 'k' in pattern
A Unicode Line Break Style Identifier defines a preferred line break style corresponding to the CSS level 3 line-break option. Specifying "lb" in a locale identifier overrides the locale’s default style (which may correspond to "normal" or "strict"). The valid values are those name attribute values in the type elements of key name="lb" in bcp47/segmentation.xml.
A Unicode Line Break Style Identifier + defines a preferred line break style corresponding to the CSS level 3 line-break option. + Specifying "lb" in a locale identifier overrides the locale’s default style (which may correspond to "normal" or "strict"). The valid values are those name + attribute values in the type elements of key name="lb" in bcp47/segmentation.xml. +
"lb" Line break style "strict"
"loose" CSS lev 3 line-break=loose
A Unicode Line Break Word Identifier defines preferred line break word handling behavior corresponding to the CSS level 3 word-break option. The valid values are those name attribute values in the type elements of key name="lw" in bcp47/segmentation.xml.
A Unicode Line Break Word Identifier + defines preferred line break word handling behavior corresponding to the CSS level 3 word-break option. + Specifying "lw" in a locale identifier overrides the locale’s default style (which may correspond to "normal" or "keepall"). The valid values are those name + attribute values in the type elements of key name="lw" in bcp47/segmentation.xml. +
"lw" Line break word handling "normal"
"phrase" Prioritize keeping natural phrases (of multiple words) together when breaking, used in short text like title and headline
A Unicode Measurement System Identifier defines a preferred measurement system. Specifying "ms" in a locale identifier overrides the default value specified by supplemental measurement system data (see Part 2 General, Measurement System Data). The valid values are those name attribute values in the type elements of key name="ms" in bcp47/measure.xml. -The determination of preferred units depends on the locale identifer: the keys ms, mu, rg, the base locale (language, script, region) and the user preferences. -For information about preferred units and unit conversion, see Unit Conversion and Unit Preferences. -
A Unicode Measurement System Identifier + defines a preferred measurement system. Specifying "ms" in a locale identifier overrides the default value specified by supplemental measurement system data for the region + (see Part 2 General, Measurement System Data). The valid values are those name attribute values in the + type elements of key name="ms" in bcp47/measure.xml. + The determination of preferred units depends on the locale identifer: the keys ms, mu, rg, the base locale (language, script, region) and the user preferences. + For information about preferred units and unit conversion, see Unit Conversion and Unit Preferences. +
"ms" Measurement system "metric"
"uksystem" UK System of measurement: feet, pints, etc.; pints are 20oz
A Measurement Unit Preference Override defines an override for measurement unit preference. The valid values are those name attribute values in the type elements of key name="mu" in bcp47/measure.xml. -For information about preferred units and unit conversion, see Unit Conversion and Unit Preferences. +
A Measurement Unit Preference Override + defines an override for measurement unit preference. The valid values are those name attribute values in the type elements of key name="mu" in + bcp47/measure.xml. + For information about preferred units and unit conversion, see Unit Conversion and Unit Preferences. +
"mu" Measurement unit override "celsius"
"tamldec" Modern Tamil decimal digits
A Region Override specifies an alternate region to use for obtaining certain region-specific default values (those specified by the <rgScope> element), instead of using the region specified by the unicode_region_subtag in the Unicode Language Identifier (or inferred from the unicode_language_subtag).
A Region Override specifies an alternate region to use for obtaining + certain region-specific default values (those specified by the <rgScope> element), instead of using the region + specified by the unicode_region_subtag in the Unicode Language Identifier (or inferred from the + unicode_language_subtag). +
"rg" Region Override"uszzzz"

The value is a unicode_subdivision_id of type “unknown” or “regular”; this consists of a unicode_region_subtag for a regular region (not a macroregion), suffixed either by “zzzz” (case is not significant) to designate the region as a whole, or by a unicode_subdivision_suffix to provide more specificity. For example, “en-GB-u-rg-uszzzz” represents a locale for British English but with region-specific defaults set to US for items such as default currency, default calendar and week data, default time cycle, and default measurement system and unit preferences. The determination of preferred units depends on the locale identifer: the keys ms, mu, rg, the base locale (language, script, region) and the user preferences.