Skip to content

Commit

Permalink
CLDR-16438 C document filtering out unneeded data (unicode-org#2766)
Browse files Browse the repository at this point in the history
* CLDR-16438 C document filtering out unneeded data

* CLDR-16438 Add main section

* CLDR-16438 Restore previous changes
  • Loading branch information
macchiati authored and srl295 committed Mar 10, 2023
1 parent 00817da commit 72de16e
Show file tree
Hide file tree
Showing 2 changed files with 66 additions and 7 deletions.
13 changes: 6 additions & 7 deletions docs/ldml/tr35-info.md
Original file line number Diff line number Diff line change
Expand Up @@ -516,21 +516,20 @@ That is, this data provides recommended fallbacks for use when a charset or supp
## 8 <a name="Coverage_Levels" href="#Coverage_Levels">Coverage Levels</a>

The following describes the structure used to set coverage levels used for CLDR.
That structure is primarily intended for internal use in CLDR tooling — it is not anticipated that users of CLDR data would need it.
That structure is used in CLDR tooling, and can also be used by consumers of CLDR data, such as described in [Data Size Reduction](tr35.md#Data_Size).

Each level adds to what is in the lower level. This list will change between releases of CLDR, and more detailed information for each level is on [Coverage Levels](https://cldr.unicode.org/index/cldr-spec/coverage-levels).
The following lists the coverage levels. The qualifications for each level may change between releases of CLDR, and more detailed information for each level is on [Coverage Levels](https://cldr.unicode.org/index/cldr-spec/coverage-levels). Each level adds to what is in the lower level, so Basic includes all of Core, Moderate all of Basic, and so on.


| Level | Description | |
| ----: | ------------- | --- |
| Code | Level | Description |
| ----: | ------------- | -------------- |
| 0 | undetermined | Does not meet any of the following levels. |
| 10 | core | Core Locale — Has minimal data about the language and writing system that is required before other information can be added using the CLDR survey tool. |
| 40 | basic | Selectable Locale — Minimal locale data necessary for a "selectable" locale in a platform UI. Very basic number and datetime formatting, etc. |
| 60 | moderate | Document Content Locale — Minimal locale data for applications such as spreadsheets and word processors to support general document content internationalization: formatting number, datetime, currencies, sorting, plural handling, and so on. |
| 80 | modern | UI Locale — Contains all fields in normal modern use, including all CLDR locale names, country names, timezone names, currencies in use, and so on. |
| 100 | comprehensive | Above modern level; typically far more data than is needed in practice. |
| 100 | comprehensive | Above modern level; typically more data than is needed in most implementations. |

Levels 40 through 80 are based on the definitions and specifications listed below.
The Basic through Modern levels are based on the definitions and specifications listed below.

```xml
<!ELEMENT coverageLevels ( approvalRequirements, coverageVariable*, coverageLevel* ) >
Expand Down
60 changes: 60 additions & 0 deletions docs/ldml/tr35.md
Original file line number Diff line number Diff line change
Expand Up @@ -157,6 +157,9 @@ The LDML specification is divided into the following parts:
* 7.1.1 [Motivation](#Motivation)
* 7.1.2 [Loose Matching](#Loose_Matching)
* 7.2 [Handling Invalid Patterns](#Invalid_Patterns)
* 8 [Data Size Reduction](#Data_Size)
* 8.1 [Vertical Slicing](#Vertical_Slicing)
* 8.2 [Horizontal Slicing](#Horizontal_Slicing)
* [Annex A Deprecated Structure](#Deprecated_Structure)
* [A.1 Element fallback](#Fallback_Elements)
* [A.2 BCP 47 Keyword Mapping](#BCP47_Keyword_Mapping)
Expand Down Expand Up @@ -3171,6 +3174,62 @@ Processes sometimes encounter invalid number or date patterns, such as a number
* For a pattern that contains a currently-invalid pattern character (applies only to date patterns, for which A-Za-z are reserved as pattern characters but not all defined as valid):
* Produce an error (set an error code or throw an exception) when an attempt is made to create a formatter with such a pattern or to apply such a pattern to an existing formatter.

## 8 <a name="Data_Size" href="#Data_Size">Data Size Reduction</a>
Software implementations may have constrained memory requirements.
The following outlines some techniques for filtering out CLDR data for a particular implementation.
The exact filtering would depend on the particular requirements of the implementation in question, of course.

Locale data can be _sliced_ to exclude data not needed by a particular implementation.
This can be _vertical slicing_: excluding a locale and all the locales inheriting from them, or _horizontal slicing_: excluding particular types of data from all locales.
For example:
* A vertical slice could retain only those locales used in a particular set of markets, such as EU locales.
* A horizontal slice could remove all data in the emoji/ directory, which are annotations for emoji and symbols.

Of course, both of these techniques can be applied.

### 8.1 <a name="Vertical_Slicing" href="#Vertical_Slicing">Vertical Slicing</a>

Locales chosen to filter out depend very much on the particular implementation.
Some information that might be useful for determining that that is the [Supplemental Territory Information](https://unicode.org/reports/tr35/tr35-info.html#Supplemental_Territory_Information),
which provides information on the use of languages in different countries/regions.
(For a human-readable chart, see [Territory-Language Information](https://unicode-org.github.io/cldr-staging/charts/latest/supplemental/territory_language_information.html).)

It is important to note that if a particular locale is in a vertical slice, then all of its parents should be as well, because of inheritance.
This is not a factor if the data is fully resolved, as in the JSON format data.

Slicing can also remove related supplemental data.
For example, the likely subtags data includes a large number of languages that may not be of interest for all implementations.
Where an the implementation only includes (say) the CLDR locales at Basic coverage in [Unicode CLDR - Coverage Levels](https://cldr.unicode.org/index/cldr-spec/coverage-levels)
(and locales inheriting from them), the likely subtag data that doesn’t match can be filtered out.

### 8.2 <a name="Horizontal_Slicing" href="#Horizontal_Slicing">Horizontal_Slicing</a>

The main reason to perform horizontal slicing is when a particular feature is not used, so the implementation wants to remove the data required for powering that feature.
For example, if an application isn't performing date formatting, it can remove all date formatting data (transitively).
It must take care to retain data used by other features: in the previous example, the number formatting data where currencies are being formatted.

Locales may also have data on a field-by-field basis that is reasonable to filter out.
For example, locales that meet the Modern level of coverage typically also include some data at a Comprehensive level.
That data is not typically needed for most implementations, and can typically be filtered out.
For example, in CLDR version 43, 58% of the script names (`//ldml/localeDisplayNames/scripts/script[@type="*"]`) are at the Comprehensive level;
in fact, ~20% of all of values for the Modern level locales are at the Comprehensive level.

The easiest way to do that is to use the CLDR Java tooling (the `cldr-code` package) to filter the data before generating the implementation's data format.
That way allows the implementation to have direct access to the CoverageLevel code that can determine the coverage level, for a given locale and path.
Once the data is transformed, such as to the JSON format, the CoverageLevel code is no longer accessible.
For example, here is a code snippet:

```
private static final SupplementalDataInfo SUPPLEMENTAL_DATA_INFO = CLDRConfig.getInstance().getSupplementalDataInfo();
...
Level pathLevel = SUPPLEMENTAL_DATA_INFO.getCoverageLevel(path, locale);
if (minimumPathCoverage.compareTo(pathLevel) >= 0) {
include(path);
}
```

Similarly, the subdivision translations represent a large body of data that may not be needed for many implementations.

* * *

## <a name="Deprecated_Structure" href="#Deprecated_Structure">Annex A Deprecated Structure</a>
Expand Down Expand Up @@ -3703,6 +3762,7 @@ Other contributors to CLDR are listed on the [CLDR Project Page](https://www.uni
**Revision 68**

* Proposed Update for CLDR Version 43.
* Added new section on [Data Size Reduction](#Data_Size)

Note that small changes such as typos and link fixes are not listed above. Modifications in previous versions are listed in those respective versions. Click on **Previous Version** in the header until you get to the desired version.

Expand Down

0 comments on commit 72de16e

Please sign in to comment.