Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ICU-21957 integrate CLDR release-42-m1 (early milestone) to ICU main for ICU 72 #2103

Conversation

pedberg-icu
Copy link
Contributor

@pedberg-icu pedberg-icu commented May 26, 2022

Checklist
  • Required: Issue filed: https://unicode-org.atlassian.net/browse/ICU-21957
  • Required: The PR title must be prefixed with a JIRA Issue number.
  • Required: The PR description must include the link to the Jira Issue, for example by completing the URL in the first checklist item
  • Required: Each commit message must be prefixed with a JIRA Issue number.
  • Issue accepted (done by Technical Committee after discussion)
  • Tests included, if applicable
  • API docs and/or User Guide docs changed or added, if applicable

This integrates CLDR release-42-m1 (milestone release as of the start of general submission) to ICU. Several significant changes:

  • CLDR version is updated from 41 to 42, notably in the icu4c/source/data/coll/* files and in the /icu4c/source/data/*/LOCALE_DEPS.json files.
  • CLDR 42 is changing the way spaces are used in patterns to be more “Unicode-like”, moving away from just ASCII-range space to use things like U+2009 thin space (on either side of en-dash in intervalFormats for Latin script), and U+202F narrow no-break-space in narrow unit patterns, to separate short/narrow AM/PM markers for Latin/Cyrillic/Greek, and to separate the year marker abbreviation in Cyrillic. These changes are not yet complete in this milestone 1 integration. However, they affect many unit tests. They also required some updates to the SimpleDateFormat parsing code (mostly in ICU4J) to allow for the non-breaking spaces in patterns. Many C tests are switched over to using u"xxx" strings.
  • For the date-time combining patterns: The forms that specify date "at" time (and thus require particular grammatical inflections in many languages) are moved to a new type="atTime" variant in CLDR, and a new DateTimePatterns%atTime bundle in ICU. What remains for the standard date-time combining patterns are the versions that do not generally require any special grammatical inflection (e.g. just using comma or space instead of literal text). This requires changes to the code that loads these patterns for SimpleDateFormat, DateTimePatternGenerator, RelativeDateTimeFormatter in order to keep using the atTime forms for these. For DateIntervalFormat we are now (with this PR) using the standard forms (to accommodate patterns like "May 25, 3:00-5:00 PM" for which the "atTime" combining patterns would be inappropriate. All of this is awaiting spec guidance from CLDR on which patterns to use in which situations.
  • CLDR has also added some alternate currency patterns. For locales other than English, these currently have draft="provisional" status and so are not converted to ICU (after vetters verify them the draft status should get updated); in the PR the alternates only appear in English:
    • currencyFormat%alphaNextToNumber: For currency formats that have no space between ¤ and the number, this alternate pattern provides a different pattern (usually with space) to use when the currency symbol has a letter on the side adjacent to the number. May be used for both standard and compact decimal formats; also for accountingFormat%alphaNextToNumber.
    • currencyFormat%noCurrency: Provides a currency-style format to use for currency-style formats without the currency symbol. This format may be used, for example, in a table of values all for the same currency (in which repeating the symbol for each value is unnecessary). Typically this is not needed for compact currency formats, since the compact decimals can be used instead, but it may be provided if necessary.
  • Other more specific changes:
    • There is a new measurement unit, duration-quarter.
    • Two new currency codes: SLE, VED
    • icu4c/source/data/coll/root.txt emoji collation is updated for Unicode 15 emoji
    • Maltese adds plural category two (and Catalan adds category many for special use mainly in compact decimals, like French, Italian, etc.)
    • The English name of QAR changes to use riyal instead of rial
    • English adds/updates display names for several languages
  • This does not yet have ICU versions of the new CLDR personName formats data (need to have a discussion about the desired ICU format for that).

Note: I added a logKnownIssue skip for part of FormattedStringBuilderTest::testInsertOverflow() to address a crash that only happens running Ci exhaustive tests for Clang/Linux (not on my local system) which appears to have nothing to do with this PR; see https://unicode-org.atlassian.net/browse/ICU-22047. When the problematic code is skipped, the test uses an alternative way of generating a long enough UnicodeString so the remaining tests in that function are valid.

@pedberg-icu pedberg-icu marked this pull request as draft May 26, 2022 01:01
@pedberg-icu pedberg-icu force-pushed the ICU-21957-brs72rc-integrate-cldr-42-m1 branch from c58675c to adfd3bc Compare May 26, 2022 02:16
@jira-pull-request-webhook
Copy link

Notice: the branch changed across the force-push!

  • icu4c/source/data/coll/zh.txt is no longer changed in the branch
  • icu4c/source/data/locales/lu.txt is now changed in the branch

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

@pedberg-icu
Copy link
Contributor Author

/azp run CI-Exhaustive

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@pedberg-icu pedberg-icu marked this pull request as ready for review May 26, 2022 04:18
macchiati
macchiati previously approved these changes May 26, 2022
@pedberg-icu pedberg-icu force-pushed the ICU-21957-brs72rc-integrate-cldr-42-m1 branch from 52aef36 to 514d9ad Compare May 26, 2022 06:21
@jira-pull-request-webhook
Copy link

Hooray! The files in the branch are the same across the force-push. 😃

~ Your Friendly Jira-GitHub PR Checker Bot

@pedberg-icu
Copy link
Contributor Author

/azp run CI-Exhaustive

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@pedberg-icu pedberg-icu force-pushed the ICU-21957-brs72rc-integrate-cldr-42-m1 branch from 3398dd9 to fa702a4 Compare May 27, 2022 05:05
@jira-pull-request-webhook
Copy link

Hooray! The files in the branch are the same across the force-push. 😃

~ Your Friendly Jira-GitHub PR Checker Bot

@pedberg-icu
Copy link
Contributor Author

/azp run CI-Exhaustive

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@srl295
Copy link
Member

srl295 commented May 27, 2022

Review tip:

$ gh pr checkout 2103
$ git diff fa702a450462925d520dc86a372c77827e82aa38..e17219582ed0396d993927066e95af0c9199f8db -IcldrVersion -IVersion

(ignore version-number changes)

Comment on lines 162 to 165
static const UChar gDefaultPattern[] =
{
0x79, 0x79, 0x79, 0x79, 0x4D, 0x4D, 0x64, 0x64, 0x20, 0x68, 0x68, 0x3A, 0x6D, 0x6D, 0x20, 0x61, 0
}; /* "yyyyMMdd hh:mm a" */
0x79, 0x4D, 0x4D, 0x64, 0x64, 0x20, 0x68, 0x68, 0x3A, 0x6D, 0x6D, 0x202F, 0x61, 0
}; /* "yMMdd hh:mm\u202Fa" */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note that the gDefaultPattern changed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. We had changed yyyy -> y in CLDR some years ago, so updating per that. I also added the \u202F here for consistency with the new formats. Not completely sure about that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@srl295 I decided to revert the use of \u202F in gDefaultPattern, so now the only thing that has changed is yyyy -> y.

srl295
srl295 previously approved these changes May 27, 2022
@pedberg-icu pedberg-icu force-pushed the ICU-21957-brs72rc-integrate-cldr-42-m1 branch from 27cdd35 to 9cefa0f Compare May 27, 2022 17:10
@jira-pull-request-webhook
Copy link

Hooray! The files in the branch are the same across the force-push. 😃

~ Your Friendly Jira-GitHub PR Checker Bot

@pedberg-icu
Copy link
Contributor Author

/azp run CI-Exhaustive

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

…for 72 (rebased on main) +

FormattedStringBuilderTest::testInsertOverflow infolns,logKnownIssue skip for CI exhaustive crash
@pedberg-icu pedberg-icu force-pushed the ICU-21957-brs72rc-integrate-cldr-42-m1 branch from 8b9ddc8 to edfbe17 Compare May 27, 2022 19:24
@jira-pull-request-webhook
Copy link

Hooray! The files in the branch are the same across the force-push. 😃

~ Your Friendly Jira-GitHub PR Checker Bot

@pedberg-icu
Copy link
Contributor Author

/azp run CI-Exhaustive

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@pedberg-icu
Copy link
Contributor Author

Review tip:

$ gh pr checkout 2103
$ git diff fa702a450462925d520dc86a372c77827e82aa38..e17219582ed0396d993927066e95af0c9199f8db -IcldrVersion -IVersion

(ignore version-number changes)

Probably need to update that to
$ git diff edfbe17..e172195 -IcldrVersion -IVersion

@pedberg-icu pedberg-icu requested review from srl295 and macchiati May 27, 2022 19:33
@pedberg-icu
Copy link
Contributor Author

@srl295 and @macchiati, for an exhaustive test failure (nothing to do with this PR) I had to add some logging and a logKnownIssue skip in FormattedStringBuilderTest::testInsertOverflow(). Also I reverted part of the change to gDefaultPattern in ICU4C SimpleDateFormat, so the only change there is now just yyyy -> y. Other that those two things, this PR is the same as what you previously approved, but needs re-approval. Thanks!

Copy link
Member

@srl295 srl295 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

diff lgtm

@pedberg-icu
Copy link
Contributor Author

Hmm, C: Linux Clang Exhaustive Tests (Ubuntu 18.04) still crashing in FormattedStringBuilderTest:testInsertOverflow(), will submit a separate PR with a fix for that.

@pedberg-icu pedberg-icu merged commit 64b3548 into unicode-org:main May 27, 2022
@pedberg-icu pedberg-icu deleted the ICU-21957-brs72rc-integrate-cldr-42-m1 branch May 27, 2022 20:50
@pedberg-icu pedberg-icu assigned srl295 and unassigned markusicu May 27, 2022
@pedberg-icu
Copy link
Contributor Author

Hmm, C: Linux Clang Exhaustive Tests (Ubuntu 18.04) still crashing in FormattedStringBuilderTest:testInsertOverflow(), will submit a separate PR with a fix for that.

The PR for that is #2104

But the crash itself may be related to the following change in FormattedStringBuilder per https://unicode-org.atlassian.net/browse/ICU-22005 (Integer overflow leading to OOB/CHECK in icu_71::FormattedStringBuilder::prepareForInsertHelper): PR #2070

bandali pushed a commit to bandali/roundcubemail that referenced this pull request Jan 31, 2023
* tests/Rcmail/Rcmail.php (test_format_date): Starting with ICU 72.1,
a NARROW NO-BREAK SPACE (NNBSP) is used instead of an ASCII space
before the meridian.  So, check for an NNBSP when using ICU >= 72.1.

References:
* https://icu.unicode.org/download/72
* unicode-org/icu#2103
bandali pushed a commit to bandali/roundcubemail that referenced this pull request Jan 31, 2023
* tests/Rcmail/Rcmail.php (test_format_date): Starting with ICU 72.1,
a NARROW NO-BREAK SPACE (NNBSP) is used instead of an ASCII space
before the meridian.  So, check for an NNBSP when using ICU >= 72.1.

References:
* https://icu.unicode.org/download/72
* https://cldr.unicode.org/index/downloads/cldr-42
* unicode-org/icu#2103
alecpl pushed a commit to roundcube/roundcubemail that referenced this pull request Jan 31, 2023
* tests/Rcmail/Rcmail.php (test_format_date): Starting with ICU 72.1,
a NARROW NO-BREAK SPACE (NNBSP) is used instead of an ASCII space
before the meridian.  So, check for an NNBSP when using ICU >= 72.1.

References:
* https://icu.unicode.org/download/72
* https://cldr.unicode.org/index/downloads/cldr-42
* unicode-org/icu#2103
alecpl pushed a commit to roundcube/roundcubemail that referenced this pull request Feb 4, 2023
* tests/Rcmail/Rcmail.php (test_format_date): Starting with ICU 72.1,
a NARROW NO-BREAK SPACE (NNBSP) is used instead of an ASCII space
before the meridian.  So, check for an NNBSP when using ICU >= 72.1.

References:
* https://icu.unicode.org/download/72
* https://cldr.unicode.org/index/downloads/cldr-42
* unicode-org/icu#2103
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants