Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ICU-22763 MF2: Handle time zones correctly in :datetime #3012

Open
wants to merge 20 commits into
base: main
Choose a base branch
from

Conversation

catamorphism
Copy link
Contributor

@catamorphism catamorphism commented May 16, 2024

Previously, the time zone components of date/time literals were ignored. In order to store the time zone properly for re-formatting, this change also changes the representation of Formattable date values to use a GregorianCalendar rather than a UDate.

This is a public API change and a design doc can be found in the "ICU 76 API proposal status" document (and was also emailed to the icu-design list on April 29.)

In the TC discussion on May 9, there was some uncertainty about whether to use GregorianCalendar or Calendar as the return type of message2::Formattable::getDate(). I'm leaving it as GregorianCalendar for now but can change it if necessary.

Checklist
  • Required: Issue filed: https://unicode-org.atlassian.net/browse/ICU-22763
  • Required: The PR title must be prefixed with a JIRA Issue number.
  • Required: The PR description must include the link to the Jira Issue, for example by completing the URL in the first checklist item
  • Required: Each commit message must be prefixed with a JIRA Issue number.
  • Issue accepted (done by Technical Committee after discussion)
  • Tests included, if applicable
  • API docs and/or User Guide docs changed or added, if applicable

@jira-pull-request-webhook
Copy link

Notice: the branch changed across the force-push!

  • icu4c/source/i18n/messageformat2_function_registry.cpp is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

Previously, the time zone components of date/time literals were ignored.
In order to store the time zone properly for re-formatting,
this change also changes the representation of `Formattable` date
values to use a `GregorianCalendar` rather than a `UDate`.
(This is a public API change.)
@jira-pull-request-webhook
Copy link

Notice: the branch changed across the force-push!

  • icu4c/source/i18n/unicode/messageformat2_formattable.h is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

@catamorphism catamorphism marked this pull request as ready for review May 16, 2024 22:03
@catamorphism
Copy link
Contributor Author

cc @echeran @mihnita

@markusicu
Copy link
Member

@FrankYFTang setting you as the main reviewer because Mihai is ooo for a while soon. Note that the API change proposal is still under discussion.

@FrankYFTang
Copy link
Contributor

Here is my view

  1. Currently, the Message Format proposal only accept string in ISO8601, without the support of timeZone Extension and Calendar extension as stated in RFC 9557 https://datatracker.ietf.org/doc/html/rfc9557

  2. therefore, the only additional information it need to and should be allowed to get back is for the case of the string has UTC offset, and ISO8601 only allow it to be in minutes precision.
    "
    5.3.4.1 Time shift between local time scale and UTC
    The time shift between the local time scale and UTC can be expressed in hours and minutes, or hours only
    "

  3. therefore, the requirement is to allow the caller to get back this information only, but not open up a can of worm that more than that. and we should also allow the interface to distinguish the case that the string has Z (UTC) or no Z.

  4. Currently, in ICU, we can get the zone offset from Calendar.get(UCAL_ZONE_OFFSET) and the DST offset from Calendar.get(UCAL_DST_OFFSET) and the UTC offset is the sum of that in millseconds
    So I think we should define a new type for input and return, which has

  5. a UDate

  6. int32_t rawOffsetGMT - represent the zone offset + dst offset in millisecond
    and this timezone (2 above) can be used ot create a TimeZone by

icu::SimpleTimeZone::SimpleTimeZone ( int32_t rawOffsetGMT,
const UnicodeString & ID
)

@FrankYFTang
Copy link
Contributor

Maybe something like

struct {
  UDate date;
  int32_t rawOffsetGMT;
} DateAndGMTOffset

?

@aphillips
Copy link
Member

Note that MF2 will not stay with only supporting vanilla 8601 strings. Real time zone support is important. An offset is insufficient.

@FrankYFTang
Copy link
Contributor

FrankYFTang commented Jun 13, 2024

Note that MF2 will not stay with only supporting vanilla 8601 strings. Real time zone support is important. An offset is insufficient.

that is what I asked this morning and the answer I got back is only vanilla 8601 strings. if MF2 is going to support RFC 9557 https://datatracker.ietf.org/doc/html/rfc9557 then we will need to consider not just timezone extension in RFC9557 but also calendar extension in RFC 9557. Therefore, it will be a super bad idea to use GregorianCalendar since the calendar could be in a different calendar system.

We should consider the following cases

  1. the string is with a Z
  2. the string is with a UTC offset
    (both 1 and 2 are just vanilla 8601 strings)
  3. the string is with timeZone extension as in RFC 9557
  4. the string is with calendar extnesion as in RFC 9557
  5. any combination of 1,2,3,4

For 4, we need a string to indicate the real time zone name but not more than that
For 5, we need a string to indicate the calendar name but not more than tat

how about

struct {
  UDate date;
  int32_t rawOffsetGMT;
  const char* zoneName; // "Z" if UTC, nullptr if timeZone offset, other if from RFC9557
  const char* calendarName; // nullptr if not specified
} DateInfo

@catamorphism
Copy link
Contributor Author

that is what I asked this morning and the answer I got back is only vanilla 8601 strings.

I apologize for giving the wrong answer -- I was answering based on the existing spec. @aphillips is more of an authority than me about how the spec will evolve in the future.

@aphillips
Copy link
Member

(personal response, chair hat off)

The offset is irrelevant. The only thing an offset is useful for is adjusting the date value (the incremental time value--Temporal calls this an Instant). Once the incremental time value is computed, you can safely forget the offset. (In most situations, I would argue that the value should be normalized to UTC and the implementation should forget the offset) The time zone, however, is used in expression of the date value later. The time zone can also be changed or removed ("floating" the value).

The use of non-Gregorian calendars in date/time serializations is more problematic: there's not a lot of implementation experience, particularly with non-binary multi-era calendars. It's usually better (or at least more common) to use a proleptic Gregorian calendar (i.e. 8601) on the wire (especially for incremental time values!) and convert using calendar rules for processing/display. Perhaps this will evolve and mature further? My experience here is limited.

There also needs to be a concept of a floating time value (one not tied to specific instants on the timeline). This could use null for the time zone. Perhaps:

struct {
  UDate date; // what is a `UDate` anyway? Is it millis since the Unix epoch?
  const char* zoneName; // IANA tz name; "UTC" if UTC; nullptr if value is floating
  const char* calendarName; // nullptr if not specified; proleptic Gregorian (8601) calendar is default
} DateInfo

In any case, this was a significant gap in the MF2 tech preview. It's important for us to add time zone management to MF2 before LDML46 and the ICU4C implementation should be designed/adapted with that in mind.

@catamorphism
Copy link
Contributor Author

Thanks for the feedback, @aphillips and @FrankYFTang . I'll take a look at these options and update the design doc accordingly.

My first thought, though, is why not something like:

struct {
  UDate date;
  SimpleTimeZone tz; // Captures offset and time zone name
  const char* calendarName; // nullptr if not specified
} DateInfo

Why not re-use the existing SimpleTimeZone class?

@catamorphism
Copy link
Contributor Author

@FrankYFTang I've made changes to match the suggestions; can you take another look? I've also updated the design proposal in the "ICU 76 API proposal status" doc.

* @internal ICU 76 technology preview
* @deprecated This API is for technology preview only.
*/
UnicodeString calendarName;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not understand why you need createTimeZone() here since all the information you need can be derived from zoneName itself. I will prefer we not adding more function if it is really needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I included it because to call DateFormat::format() to eventually format the date, a TimeZone object has to be constructed (to pass to DateFormat's adoptTimeZone() or setTimeZone()) method. But it doesn't have to be a method of the DateInfo struct (changed in 1c68d51).

* @internal ICU 76 technology preview
* @deprecated This API is for technology preview only.
*/
UnicodeString zoneName;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be const?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DateInfo is movable (because it can appear inline in Formattable, which is movable), so the members can't be const. I could change DateInfo to a class and make the members private, if you think that would be better.

* @internal ICU 76 technology preview
* @deprecated This API is for technology preview only.
*/
UnicodeString calendarName;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be const?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above

@srl295 srl295 self-requested a review June 28, 2024 20:39
Copy link
Member

@srl295 srl295 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looking GTM, some comments

icu4c/source/test/testdata/message2/more-functions.json Outdated Show resolved Hide resolved
* @internal ICU 76 technology preview
* @deprecated This API is for technology preview only.
*/
UnicodeString calendarName;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the calendarName supposed to map to? Is it the 'old style name' (gregorian) matching

virtual const char * getType() const = 0;
or a bcp47 -u-ca type (gregory)?

If it's from the locale, wouldn't that be visible from the locale of the message format itself?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The spec currently says that it's this, but it's not fully defined in v45 (the calendar option is reserved for future standardization). Probably we want to support whatever JS Temporal does or at least be compatible with that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should probably link there, as a 'calendar type' in ICU4C is one of the old types generally.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@srl295 Added a comment in 82f9ddc - let me know if that looks OK.

@srl295
Copy link
Member

srl295 commented Jun 28, 2024

needs squash

@catamorphism
Copy link
Contributor Author

@FrankYFTang Are you OK with the current state of the design doc for this PR? The JIRA ticket shows as "Accepted", but I've made changes to the design doc since your last comment.

return 0;
}
// 'zzzz' to handle the case where the timezone offset follows the ISO 8601 datetime
if (sourceStr.length() > 25) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain where the 25 came from? How do you conclude it need to > 25? and 25 is enough?

https://unicode.org/reports/tr35/tr35-dates.html#Contents show

zzzz Pacific Daylight Time The long specific non-location format. Where that is unavailable, falls back to the long localized GMT format ("OOOO").

and

OOOO GMT-08:00 The long localized GMT format.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose it doesn't really need to check the length at all, just whether the last character is 'Z'. Changed that in c63c6b5.

@catamorphism
Copy link
Contributor Author

In 197caef, I added some comments explaining the meaning of the fields in the DateInfo struct. These can be checked against the code in messageformat2_formattable.cpp that formats an existing DateInfo struct, and that parses a date literal into a DateInfo -- the two locations where the DateFormat::adoptTimeZone() method is called.

@catamorphism
Copy link
Contributor Author

I also removed the calendarName field in 9800253. On further thought, it doesn't seem necessary; a DateInfo represents a parsed date literal, and date literal strings don't include a calendar name. In the future, calendar will be an option on the :datetime formatter. Then it will be handled the same way as all other options, and the return value from the formatter will represent the previously passed-in calendar by default. Therefore, I don't think we need it in the DateInfo struct, which is meant to represent the parsed version of a date literal string.

@catamorphism
Copy link
Contributor Author

This needs to be rebased because of #3050 moving where the test data is, but I'll do that at the very end in order to avoid losing review comments.

if (U_FAILURE(errorCode)) {
return 0;
}
return dateParser->parse(sourceStr, errorCode);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. dateParser->parse() is a const method, which mean it could be shared across thread, right?
  2. there are only total 4 calls into tryPattern() in this method

tryPattern("YYYY-MM-dd'T'HH:mm:ss"
tryPattern("YYYY-MM-dd"
tryPattern("YYYY-MM-dd'T'HH:mm:ssZ"
tryPattern("YYYY-MM-dd'T'HH:mm:sszzzz"

So there are total 4 possible pat , right?
Currently, the creation of the DateFormat is inside the tryPattern()
Everytime the tryPattern() got called, the code need to create a SimpleDateFormat and later destroy it
but there are only total 4 possible SimpleDateFormat needed to be created and the usage are thread safe (see my point 1 above)
So... should we have an lazy initialization routine which create these four SimpleDateFormat and cache it globally, and reuse them later. The creation need to be protected to be init once and therefore mutex to prevent different threads create them concurrently, but then we can reuse them across thread and destroy upon lib clean up.

In this way, we do not need to repeatly create and destroy the SimpleDateFormat

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that seems reasonable. I'll try it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, done in 8d4da42. I'm not sure of how to unit-test this (i.e. testing that the DateFormats really are initialized only once), but I'll look for examples in the test suite.

if (len <= 6) {
return false;
}
return ((sourceStr[len - 6] == PLUS || sourceStr[len - 6] == HYPHEN)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a comment in the beginning of this function to explain what is the expected input value and the output

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 4a7da29

@catamorphism
Copy link
Contributor Author

By the way, I'll be on vacation from now until September 9, so I won't see further comments until I return.

@@ -57,6 +57,15 @@ namespace message2 {

DateTimeType type;
DateTimeFactory(DateTimeType t) : type(t) {}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so.. you store 4 pointer per DateTimeFactory object. But in messageformat2_function_registry.cpp you create multiple DateTimeFactory object, each object will crate the same four DateFormat. But there are no differences between them from the multiple DateTimeFactory . Could you use InitOnce and cleanup code tostore that four DateFormat to be globally shared ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@catamorphism search for icu::UInitOnce and ucln_common_registerCleanup for examples.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants