Add support for the Gregorian Calendar availableFormats #480

gregtatum · 2021-02-08T17:49:39Z

This PR adds support for the availableFormats key to the DateTimeFormats, in order to provide support for skeleton matching. It is part of the dependencies for Issue #481. I'm splitting out this piece from the rest of the work, as it is fairly self contained, and it's what the rest of the work will be based upon. It has a fairly large code diff, as it's generating lots of test data, so it should be helpful to get it in place first.

This is ready for review.

(If anyone is curious, my draft PR for all of the components bag work is available in #479, which includes things like Skeleton.)

jira-pull-request-webhook · 2021-02-08T18:01:52Z

Notice: the branch changed across the force-push!

components/datetime/src/options/components.rs is no longer changed in the branch
components/datetime/src/options/preferences.rs is no longer changed in the branch
components/datetime/src/options/style.rs is no longer changed in the branch
components/datetime/tests/fixtures/mod.rs is no longer changed in the branch
components/datetime/tests/fixtures/structs.rs is no longer changed in the branch
components/datetime/tests/fixtures/tests/components.json is no longer changed in the branch

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

coveralls · 2021-02-08T18:02:04Z

Pull Request Test Coverage Report for Build 2e341c4b05f71c613135ee01a632d7bf3a49039e-PR-480

290 of 423 (68.56%) changed or added relevant lines in 9 files are covered.
30 unchanged lines in 5 files lost coverage.
Overall coverage decreased (-0.07%) to 73.142%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
components/datetime/src/fields/length.rs	2	3	66.67%
components/datetime/src/provider/helpers.rs	3	4	75.0%
components/datetime/src/fields/mod.rs	2	5	40.0%
components/datetime/src/pattern/error.rs	4	7	57.14%
components/datetime/src/provider/mod.rs	15	28	53.57%
components/provider_cldr/src/transform/dates.rs	29	42	69.05%
components/datetime/src/fields/symbols.rs	56	78	71.79%
components/datetime/src/pattern/mod.rs	36	67	53.73%
components/datetime/src/skeleton.rs	143	189	75.66%

Files with Coverage Reduction	New Missed Lines	%
components/datetime/src/fields/length.rs	1	86.67%
components/datetime/src/fields/mod.rs	1	40.91%
components/datetime/src/provider/mod.rs	1	36.0%
components/datetime/src/fields/symbols.rs	7	82.32%
utils/litemap/src/map.rs	20	71.6%

Totals
Change from base Build 30a3909542955e156a11a979f57d8c38c1dbeac5:	-0.07%
Covered Lines:	6269
Relevant Lines:	8571

💛 - Coveralls

resources/testdata/data/json/dates/gregory@1/ar-EG.json

gregtatum · 2021-02-08T18:12:56Z

resources/testdata/data/json/dates/gregory@1/ar-EG.json

+        "medium": "{1}, {0}",
+        "short": "{1}, {0}"
+      },
+      "available_formats": [


I was trying to decide on the best representation for this data, as it's a key pair value. Making it a simple tuple seemed like the best choice here to me.

Ultimately, the key will be converted from Cow<'static, str> to a struct Skeleton that can then be used for skeleton matching. I'm not 100% sure if it's useful to provide these as Cows, or what the ultimate design of the memory management for providers is. However, this is following the existing examples on how this data is being read in, so I'm assuming it's compatible.

My understanding of the data flow is

The serialized JSON is read in.

This gets deserialized into the DateTimeFormatsV1 structure.

Skeletons are generated from the DateTimeFormatsV1

Skeleton matching happens with the Skeleton struct.

Yeah, I think that for now we can keep the key as Cow<str>, and separately once we have more complete data structures we can experiment with shifting from stringified representations to serializing/deserializing structs and see what it does to perf/mem/size.

jira-pull-request-webhook · 2021-02-08T19:21:03Z

Notice: the branch changed across the force-push!

Cargo.lock is now changed in the branch
components/datetime/Cargo.toml is now changed in the branch
components/datetime/src/provider/mod.rs is different
resources/testdata/data/json/dates/gregory@1/ar-EG.json is different
resources/testdata/data/json/dates/gregory@1/ar.json is different
resources/testdata/data/json/dates/gregory@1/bn.json is different
resources/testdata/data/json/dates/gregory@1/ccp.json is different
resources/testdata/data/json/dates/gregory@1/en-US-posix.json is different
resources/testdata/data/json/dates/gregory@1/en-ZA.json is different
resources/testdata/data/json/dates/gregory@1/en.json is different
resources/testdata/data/json/dates/gregory@1/es-AR.json is different
resources/testdata/data/json/dates/gregory@1/es.json is different
resources/testdata/data/json/dates/gregory@1/fr.json is different
resources/testdata/data/json/dates/gregory@1/ja.json is different
resources/testdata/data/json/dates/gregory@1/ru.json is different
resources/testdata/data/json/dates/gregory@1/sr-Cyrl.json is different
resources/testdata/data/json/dates/gregory@1/sr-Latn.json is different
resources/testdata/data/json/dates/gregory@1/sr.json is different
resources/testdata/data/json/dates/gregory@1/th.json is different
resources/testdata/data/json/dates/gregory@1/tr.json is different
resources/testdata/data/json/dates/gregory@1/und.json is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

components/datetime/src/provider/mod.rs

sffc

I would like to see the code using these resources before we add the resources to the data model.

sffc · 2021-02-08T20:02:56Z

components/datetime/src/provider/mod.rs

+            pub style_patterns: StylePatternsV1,
+
+            #[cfg_attr(feature = "provider_serde", serde(with = "tuple_vec_map"))]
+            pub available_formats: Vec<(Cow<'static, str>, Cow<'static, str>)>,


If you put cows in a data struct, you should add a lifetime parameter to the data struct instead of using 'static. Alternatively, you can use TinyStr (for skeletons) or SmallStr (for patterns) and obviate the need for a lifetime parameter.

It looks like the skeletons are always, or almost always, 8 ASCII characters or shorter, in which case TinyStr8 would work very well. If you want to allow skeletons to be longer than that, consider TinyStrAuto.

If you put cows in a data struct, you should add a lifetime parameter to the data struct instead of using 'static.

This is where a lot of my confusion over the memory model for data providers really stems. All of the other strings are using the 'static lifetime parameter, so I'm not sure if there is any difference between that usage, and this usage. I was honestly a bit surprised this code compiled.

symbols!(months, [Cow<'static, str>; 12]); symbols!(weekdays, [Cow<'static, str>; 7]); symbols!( day_periods { am: Cow<'static, str>, pm: Cow<'static, str>, noon: Option<Cow<'static, str>>, midnight: Option<Cow<'static, str>>, } );

I investigated the source of the Cows, which appears to be #256. It seems like this was initially landed to get the data in tree, with the idea that we would optimize in the future. Similarly I'm a bit nervous about spending a lot of time optimizing at the beginning, but let's start with a decent choice. I think there are different trade-offs in this data provider format, but I'd be happy to try the the non-heap/inline solutions first, and then measure the memory afterwards.

Should I file an issue to add lifetime parameters to the Cow structures that are already here? If we're not adding new ones, it seems like the existing ones should be addressed at some point.

edit: This graph was wrong.

Re-posting this with correct utf8 byte size, instead of character count:

Per the style guide:

If lifetime parameters are allowed, use &'a str.

If lifetime parameters are not allowed, use one of:

TinyStr if the string is ASCII-only.

SmallString for shorter strings, with a stack size ∈ {8, 12, 16, 20} that fits at least 99% of cases.

Cow<'static, str> for longer strings.

The largest suggested stack size of 20 for this data set is: (aside: I'm not sure if stack size in the style guide is including the data required by by the SmallString struct itself beyond just the data)

20 bytes: 96.5%
29 bytes: 98.8%
30 bytes: 99.5%

Which, according to the style guide suggests we use Cow<'static, str>.

I need to update the style guide. As of my recent changes to the data provider trait, cows can and should have real lifetime parameters. I thought I updated all existing structs when I did that PR but I guess I missed this one hiding in a macro. You don't have to worry about migrating in this PR; I'll follow up in #257.

About data types: I'm more talking about skeletons than patterns. I believe that patterns are going to be long enough that you need a cow. However, I think the keys (skeletons) might be a good just case for TinyStr.

sffc · 2021-02-08T20:11:25Z

components/provider_cldr/src/transform/dates.rs

+                medium: other.medium.get_pattern().clone(),
+                short: other.short.get_pattern().clone(),
+            },
+            available_formats: other.available_formats.0.clone(),


You don't have to copy the available_formats literally from CLDR into ICU4X. Is there any pre-processing you can do to optimize them?

sffc · 2021-02-08T20:12:04Z

resources/testdata/data/json/dates/gregory@1/ar-EG.json

+        "MMMMW-count-zero": "الأسبوع W من MMMM",
+        "MMMMW-count-one": "الأسبوع W من MMMM",
+        "MMMMW-count-two": "الأسبوع W من MMMM",
+        "MMMMW-count-few": "الأسبوع W من MMMM",
+        "MMMMW-count-many": "الأسبوع W من MMMM",
+        "MMMMW-count-other": "الأسبوع W من MMMM",


Do you have the plural rule selection implemented?

No I don't, and I haven't really thought through this problem. Do you have recommendations here? Currently I discard them in the Skeleton struct construction. I don't really have a mental model of how these will be used.

Ok, I'm going to try to change the representation here to handle these. I see what's going on with them now.

If you don't want to feature-creep your work, open an issue to figure these out later, and in the mean time, copy only the "other" form into the data.

gregtatum · 2021-02-09T15:37:49Z

Thanks for the review!

I would like to see the code using these resources before we add the resources to the data model.

That seems fair enough. I would like to work incrementally, so I may investigate ways to separate out different sections still.

sffc · 2021-02-09T17:51:52Z

Thanks for the review!

I would like to see the code using these resources before we add the resources to the data model.

That seems fair enough. I would like to work incrementally, so I may investigate ways to separate out different sections still.

It's good to keep this as a separate PR; I'd just like to review the code PR first so I can get an idea of how this data is being used. I imagine that there may be additional changes you may want to make once we see how the data is used.

codecov-io · 2021-02-10T14:51:52Z

Codecov Report

Merging #480 (9ce6021) into main (30a3909) will decrease coverage by 0.13%.
The diff coverage is 68.55%.

@@            Coverage Diff             @@
##             main     #480      +/-   ##
==========================================
- Coverage   74.22%   74.08%   -0.14%     
==========================================
  Files         128      129       +1     
  Lines        7840     8294     +454     
==========================================
+ Hits         5819     6145     +326     
- Misses       2021     2149     +128

Impacted Files	Coverage Δ
components/datetime/src/lib.rs	`100.00% <ø> (ø)`
components/datetime/src/fields/mod.rs	`40.90% <40.00%> (+40.90%)`	⬆️
components/datetime/src/provider/mod.rs	`36.00% <53.57%> (+11.00%)`	⬆️
components/datetime/src/pattern/mod.rs	`66.11% <53.73%> (-15.02%)`	⬇️
components/datetime/src/pattern/error.rs	`30.76% <57.14%> (+30.76%)`	⬆️
components/datetime/src/fields/length.rs	`86.66% <66.66%> (+3.33%)`	⬆️
components/provider_cldr/src/transform/dates.rs	`72.79% <69.04%> (-0.39%)`	⬇️
components/datetime/src/fields/symbols.rs	`82.32% <71.79%> (+5.62%)`	⬆️
components/datetime/src/provider/helpers.rs	`58.94% <75.00%> (ø)`
components/datetime/src/skeleton.rs	`75.66% <75.66%> (ø)`
... and 6 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 30a3909...9ce6021. Read the comment docs.

gregtatum · 2021-02-10T15:03:22Z

I updated the PR to address the review feedback so far. I still need to rebase my bigger work on top of these changes, so I may tweak the design a bit more, however this incorporates the review feedback so far.

sffc · 2021-02-10T17:42:55Z

components/provider_cldr/src/transform/dates.rs

@@ -132,6 +132,9 @@ impl From<&cldr_json::StylePatterns> for gregory::patterns::StylePatternsV1 {

 impl From<&cldr_json::DateTimeFormats> for gregory::patterns::DateTimeFormatsV1 {
    fn from(other: &cldr_json::DateTimeFormats) -> Self {
+        use gregory::patterns::{CountV1, SkeletonPatternV1, SkeletonsV1};


Instead of CountV1, it would be nice if we could use an actual standard plural enum from the plurals crate that has a well-defined string representation.

I have a few thoughts here:

A: I was wondering about dependencies when I wrote it. I looked into using PluralCategory originally, but didn't understand the strategy for inter-dependencies between the various component crates. I could use the PluralCategory directly, but this would add icu_plurals as a dependency to icu_datetime. Is this OK?

B: I'm a bit concerned with sharing serializable representations with code outside of the provider. The V1 signals a social contract to maintain the serialized represented. If I naively pulled in PluralCategory, then this gives no signal that it's used as a serializable source, and care should be taken when changing its representation. Perhaps this is fine here since this is a fairly standardized representation. However, is there still a risk of this assumption changing? This type of thing was a source of bugs when I maintained the Firefox Profiler.

C: I read through data-pipeline.md again, but I'm still not 100% sure on the guarantees here. Here in the provider, every struct has an associated version.

How granular is this version number?

Does this mean that individual pieces of the serialization can change? For example, what happens if I make a breaking change to CountV1, does that mean every struct that contains a CountV1 needs to be bumped versions?

Do we not care yet to define this breakdown, and label everything as V1 until something else is needed?

Can we have something with a V2 key, but contain V1 structs?

What happens when we have a breaking change?

A: I was wondering about dependencies when I wrote it. I looked into using PluralCategory originally, but didn't understand the strategy for inter-dependencies between the various component crates. I could use the PluralCategory directly, but this would add icu_plurals as a dependency to icu_datetime. Is this OK?

We could have it be an optional dependency and have skeletons be behind a feature (on by default, but this would allow you to build datetime with just styles for smaller footprint!)

I mean, icu_datetime is going to have to depend on icu_plurals anyway in order to operate on this data, right?

In any case, I don't have any problem with adding this dependency because icu_datetime is higher-level than icu_plurals. It will also need to eventually gain a dependency on icu_numbers.

On your other questions:

B: This would assume PluralCategory is serializable as an enum. If it's not, then I would implement serde on that enum over in icu_plurals behind that crate's serde feature. Then the stability guarantee comes from PluralCategory's serde form.

C: Let's not worry about data struct versioning until ICU4X 1.0. These are going to be interesting questions to discuss and answer when we actually have to start worrying about our stability guarantees. Until that time, let's not worry about it much.

I mean, icu_datetime is going to have to depend on icu_plurals anyway in order to operate on this data, right?

What I'm saying is that icu_datetime that is only working with dateStyle/timeStyle is a usable date/time formatting API that may not pull in icu_plurals, am I correct?
icu_plurals become important when skeletons come into play.

Just re-reading the style guide on crate features:

https://github.com/unicode-org/icu4x/blob/master/docs/process/style_guide.md#when-to-add-crate-features--suggested

When adding enhancements to an ICU4X component, introduce features as a way for the end user to control the code size of their compilation as follows:

If the enhancement adds a new crate dependency, it should be behind a feature.

If the enhancement contains code that is not considered best practice, for example if it is mainly used for debugging diagnostics or development, then it should be behind a feature.

So it looks like this may fall under case 1. My hope is that dead code elimination would take care of this one, but I have no problem with putting the skeleton handling stuff behind a feature flag.

Given the conversation here, I'm doubting the use of storing the string data in memory at all. It would be fairly easy to implement Serialize and Deserialize for Field and PatternItem. Then the data could use something like the following shape:

pub type SkeletonTupleV1 = ( SmallVec<[fields::Field; 5]>, SmallVec<[Vec<pattern::PatternItem>; 1]>, );

Then all of the pattern matching machinery can take a lifetime parameter, and do all of its work via reference, without worrying about owning or copying any of this data. The data could be serialized back into the standard UTS 35 representations for strings, which is convenient and well-documented, but the in-memory representation would be exactly how we want to use it for the actual data processing.

The processing step can handle pre-sorting the information to make sure it's in the most efficient and direct form for skeleton matching. The skeleton matching machinery could then be a wrapper with a lifetime over this data representation.

I am very much in favor of pre-processing CLDR strings such that the data stored in ICU4X DataProvider needs minimal parsing and processing at runtime. I think it's okay if it incurs a small data size penalty as well.

sffc · 2021-02-10T17:57:25Z

components/datetime/src/provider/mod.rs

+        //               9                16         0.1%                  100.0%
+        //              10                 8         0.0%                  100.0%
+
+        // TODO - This could have better memory locality with TinyStr, however it does not


zbraniecki/tinystr#32

Hopefully we can get that solved before you need to merge this PR.

components/provider_cldr/src/transform/dates.rs

sffc · 2021-02-10T18:00:38Z

components/datetime/src/provider/mod.rs

+            pub pattern: Cow<'static, str>,
+            #[cfg_attr(
+                feature = "provider_serde",
+                serde(skip_serializing_if = "Option::is_none")


You need to also make this dependent on the serialize_none feature, which is necessary for bincode. Find some boilerplate from elsewhere and copy it in here.

We should probably add a CI check that fails if bincode doesn't round-trip. #491

sffc · 2021-02-10T18:52:41Z

components/datetime/src/provider/mod.rs

+        // currently implement Serialize or Deserialize traits.
+        //
+        // pub type SkeletonsV1 = Vec<(TinyStr16, SmallVec<[SkeletonPatternV1; 1]>)>,
+        pub type SkeletonsV1 = Vec<(Cow<'static, str>, SmallVec<[SkeletonPatternV1; 1]>)>;


Suggestion: instead of using SmallVec<[SkeletonPatternV1; 1]>, consider an enum with two variants: a single pattern, and a TupleVecMap of multiple patterns keyed by plural form.

gregtatum · 2021-02-22T18:11:34Z

It turns out the skeleton data in the CLDR has identical patterns for all of the different plural variants (except "other"). I'm thinking it will be better and simpler to land this code only with the assumption that there is a 1:1 relationship between skeleton to pattern.

The only skeletons that have multiple patterns have at most one variant. This includes for plural rules, where the count-one, count-two, etc variants are identical. The only plural rule that has anything different is the count-other field.

Here is the list of patterns that have variants for a given a skeleton.

locale	skeleton	pattern
en-CA	Md	MM-dd
en-CA	Md-alt-variant	d/M
en-CA	MEd	E, MM-dd
en-CA	MEd-alt-variant	E, d/M
en-CA	MMdd	MM-dd
en-CA	MMdd-alt-variant	dd/MM
en-CA	yM	y-MM
en-CA	yM-alt-variant	M/y
en-CA	yMd	y-MM-dd
en-CA	yMd-alt-variant	d/M/y
en-CA	yMEd	E, y-MM-dd
en-CA	yMEd-alt-variant	E, d/M/y
fil	MMMMW-count-one	'ika'-W 'linggo' 'ng' MMMM
fil	MMMMW-count-other	'linggo' W 'ng' MMMM
fil	yw-count-one	'ika'-w 'linggo' 'ng' Y
fil	yw-count-other	'linggo' w 'ng' Y
hy	MMMMW-count-one	MMMM W-ին շաբաթ
hy	MMMMW-count-other	MMMM W-րդ շաբաթ
hy	yw-count-one	Y թ․ w-ին շաբաթ
hy	yw-count-other	Y թ․ w-րդ շաբաթ
pcm	yw-count-one	'Wik' w 'fọ' Y
pcm	yw-count-other	'Wiik' w 'fọ' Y
ps	MMMMW-count-one	اونۍ‘ W د MMMM‘
ps	MMMMW-count-other	اونۍ W د MMMM
ps-PK	MMMMW-count-one	اونۍ‘ W د MMMM‘
ps-PK	MMMMW-count-other	اونۍ W د MMMM

This list is small enough, that I don't think it's worth handling. In addition, we are not accepting free-form pattern input to match against, but rather specific options bags for the components. I don't think there is an easy way to prefer one variant over the other. This will simplify the initial implementation, and then we can re-visit the decision if we actually do need a way to access these variants.

I determined this with this script.

I also did a spot check of the CLDR XML data to ensure that the JSON data was accurate, and it appears so. (I checked this locally more thoroughly.)

sffc · 2021-02-22T21:25:28Z

This list is small enough, that I don't think it's worth handling.

The list being small shouldn't by itself justify low-level design decisions like this one if they reduce i18n quality. For example, it looks like the plural forms produce actually different results for "pcm".

In addition, we are not accepting free-form pattern input to match against, but rather specific options bags for the components. I don't think there is an easy way to prefer one variant over the other.

I'm also not sure what to do about variants that aren't tied to a Unicode extension keyword.

I know you want to check this in, but I'd like to discuss this question at the meeting, because I think it's an important decision that could serve as precedent for future scenarios. So can we talk about it in Thursday?

gregtatum · 2021-02-23T17:00:10Z

The list being small shouldn't by itself justify low-level design decisions like this one if they reduce i18n quality.

First off, I agree that just because a list is small, that it shouldn't be reason to reduce the quality of the overall output. I'm focusing on delivering a correct solution, but also keeping an eye towards limiting scope of this initial implementation.

My understanding was that the count-other was equivalent to alt-variant, but digging further, I realize that this is not the case. Looking into the plurals.json test data I see the following definition:

      "pcm": {
        "pluralRule-count-one": "i = 0 or n = 1 @integer 0, 1 @decimal 0.0~1.0, 0.00~0.04",
        "pluralRule-count-other": " @integer 2~17, 100, 1000, 10000, 100000, 1000000, … @decimal 1.1~2.6, 10.0, 100.0, 1000.0, 10000.0, 100000.0, 1000000.0, …"
      },

This leads me to realize that effectively for "pcm" the count-other is actually the plural variant. So I agree, this should be handled here, otherwise it would be a low quality result.

I'm also not sure what to do about variants that aren't tied to a Unicode extension keyword.

I think for this first check-in, we should ignore them, and then file a follow-up to discuss them and come to a decision on them.

I know you want to check this in, but I'd like to discuss this question at the meeting, because I think it's an important decision that could serve as precedent for future scenarios. So can we talk about it in Thursday?

I'd be happy to discuss further if you feel the need, but I think we may be in agreement now that I realize my misunderstanding of the count-other case. In summary:

I will handle the -count-* cases
I will not handle the -alt-variant case, and file a follow-up issue to discuss.

sffc · 2021-02-23T18:13:01Z

That sounds good. Alternatively, if you want to reduce the scope, I see that the plural forms are only used in the "week of year" and "month of year" patterns; you could just omit those patterns, since there is no components bag that triggers them, and make a note to follow up in #488.

components/datetime/src/provider/mod.rs

gregtatum · 2021-03-02T20:45:23Z

I've updated this PR with the existing feedback. There are 4 new commits listed here. Locally, I rewrote my history while I was working, but used a merge strategy to update this PR. You can look at the total code diff, or look at it on a per-commit basis. This would be Part 1, and Part 2 of the new commits:

Merge in master
Part 1: Directly serialize patterns and skeletons in the CLDR and prov…
Part 2: Re-compute the testdata for skeletons
Merge branch 'components-bag' (early part) into available-formats

components/datetime/src/pattern/mod.rs

… support

gregtatum · 2021-03-18T16:41:56Z

I found a new issue. I didn't enforce sorting of the skeleton fields from the CLDR. This was fixed in:

1584020

sffc

Nice work! A few nitpicks and suggestions.

components/datetime/src/skeleton.rs

components/datetime/src/provider/mod.rs

…alid_value

sffc · 2021-03-25T21:51:36Z

components/datetime/src/skeleton.rs

+        while let Some(item) = seq.next_element()? {
+            items.push(item)
+        }
+        Ok(Skeleton(items))


Nitpick: We shouldn't assume that the fields are in order, because the bincode could come from an untrusted source. We should just validate that they are indeed in order (when reading a new item, check that it is greater than the previous item), and return a Serde error if they are not.

Done.

Question: Do we know what kind of validation we need to do for binary sources and what kind of guarantees serde bincode provides?

Actually, if we don't trust the source, then I'd also like to check for duplicate fields.

Ok, this is done and with tests.

Good question. IMO, our responsibility for validation is only on the level of semantic correctness. Bincode's own deserializer code should be able to handle correctness of types and such. We need to perform semantic correctness validation whenever we have our own Deserialize impl, or when deserializing into a struct that has certain invariants to maintain.

I think I agree here. I messed around with the binary representation, and serde seemed to do the correct thing when providing invalid binary data.

components/datetime/src/skeleton.rs

components/datetime/src/fields/symbols.rs

components/provider_cldr/src/transform/dates.rs

sffc · 2021-03-26T21:17:10Z

components/datetime/src/skeleton.rs

+                if prev_item > &item {
+                    return Err(de::Error::invalid_value(
+                        de::Unexpected::Other(&format!("field item out of order: {:?}", item)),
+                        &"ordered field symbols representing a skeleton",
+                    ));
+                }
+                if prev_item == &item {
+                    return Err(de::Error::invalid_value(
+                        de::Unexpected::Other(&format!("duplicate field: {:?}", item)),
+                        &"ordered field symbols representing a skeleton",
+                    ));
+                }


Nitpick (optional): combine error handling via prev_item >= &item. Less code to read and compile for an error that shouldn't happen in normal situations.

Since everything is green, I may fast follow with this one.

sffc · 2021-03-26T21:20:37Z

components/datetime/src/skeleton.rs

+        while let Some(item) = seq.next_element()? {
+            items.push(item)
+        }
+        Ok(Skeleton(items))


Good question. IMO, our responsibility for validation is only on the level of semantic correctness. Bincode's own deserializer code should be able to handle correctness of types and such. We need to perform semantic correctness validation whenever we have our own Deserialize impl, or when deserializing into a struct that has certain invariants to maintain.

sffc

Switching my review status to Approval with one more optional nitpick

gregtatum force-pushed the available-formats branch from fbf52ad to 6521dce Compare February 8, 2021 18:01

gregtatum commented Feb 8, 2021

View reviewed changes

resources/testdata/data/json/dates/gregory@1/ar-EG.json Outdated Show resolved Hide resolved

gregtatum commented Feb 8, 2021

View reviewed changes

gregtatum force-pushed the available-formats branch from 6521dce to 6398371 Compare February 8, 2021 19:20

gregtatum commented Feb 8, 2021

View reviewed changes

components/datetime/src/provider/mod.rs Show resolved Hide resolved

gregtatum marked this pull request as ready for review February 8, 2021 19:59

gregtatum requested review from sffc and a team as code owners February 8, 2021 19:59

sffc requested changes Feb 8, 2021

View reviewed changes

sffc mentioned this pull request Feb 10, 2021

Serde support zbraniecki/tinystr#32

Closed

sffc requested changes Feb 10, 2021

View reviewed changes

sffc added the waiting-on-author PRs waiting for action from the author for >7 days label Feb 18, 2021

This was referenced Feb 25, 2021

Implement year-week calculations for formatting #488

Closed

Add CI check for bincode support #491

Closed

Skeleton matching with the components bag #479

Closed

gregtatum commented Mar 2, 2021

View reviewed changes

components/datetime/src/provider/mod.rs Outdated Show resolved Hide resolved

sffc requested changes Mar 2, 2021

View reviewed changes

components/datetime/src/pattern/mod.rs Outdated Show resolved Hide resolved

components/datetime/src/pattern/mod.rs Show resolved Hide resolved

sffc mentioned this pull request Mar 3, 2021

Exotic Types as fields of data provider structs #523

Closed

Review 3: Adjust pattern serialization write directly into the formatter

d5b4a0f

This was referenced Mar 16, 2021

Add FixedDecimalFormat data provider plumbing #541

Merged

Prefer String over Cow<'static, str> in CLDR JSON structs #558

Closed

gregtatum added 9 commits March 18, 2021 11:27

Review 4: Use a split iterator when deserializing skeletons

dd688d8

Review 5: Derive the Eq trait for Fields, which is needed for Litemap…

6a34025

… support

Review 6: Convert skeleton tuples to LiteMap

e12a4bb

Review 7: Re-generate the testdata for skeleton LiteMaps

858c6b7

Review 8: Consolidate the skeleton serialization into skeleton.rs

e4ecc1a

Review 9: Regenerate testdata to fix canonical field order in skeletons

bcd0197

Review 10: Update tests to fix canonical field order in skeletons

9565e67

Review 11: Add fmt::Display for serde errors with skeletons

ae2338f

Review 12: Enforce sorting of skeleton fields coming from CLDR

1584020

gregtatum requested a review from sffc March 18, 2021 16:36

sffc requested changes Mar 25, 2021

View reviewed changes

gregtatum added 5 commits March 25, 2021 16:02

ReviewB 1: Add support for skeleton bincode serialization

01eaba9

ReviewB 2: Add PartialOrd and Ord for Fields and derive Ord for Skeleton

439aa4e

fixup! Review 2: Adjust the serde Error for Pattern to de::Error::inv…

8d0d73b

…alid_value

Merge in main to available-formats

eb8213b

Fix license check errors

067e657

gregtatum requested a review from sffc March 25, 2021 21:11

sffc mentioned this pull request Mar 25, 2021

Error handling and dependencies in CLDR transformer code #574

Closed

sffc reviewed Mar 25, 2021

View reviewed changes

gregtatum added 2 commits March 26, 2021 09:21

Apply review feedback on field ordering

8295df1

Add tests for invalid bincode and duplicate field check

9ce6021

sffc reviewed Mar 26, 2021

View reviewed changes

sffc approved these changes Mar 26, 2021

View reviewed changes

gregtatum merged commit 19642c3 into unicode-org:main Mar 29, 2021

gregtatum mentioned this pull request Mar 29, 2021

Simplify DeserializeSkeletonBincode with one less branch #582

Merged

gregtatum mentioned this pull request Apr 13, 2021

ECMA-402 compatibility for the components::Bag #645

Open

11 tasks

Add support for the Gregorian Calendar availableFormats #480

Add support for the Gregorian Calendar availableFormats #480

Conversation

gregtatum commented Feb 8, 2021 • edited Loading

jira-pull-request-webhook bot commented Feb 8, 2021

coveralls commented Feb 8, 2021 • edited Loading

Pull Request Test Coverage Report for Build 2e341c4b05f71c613135ee01a632d7bf3a49039e-PR-480

💛 - Coveralls

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jira-pull-request-webhook bot commented Feb 8, 2021

sffc left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gregtatum Feb 9, 2021 • edited Loading

Choose a reason for hiding this comment

gregtatum Feb 9, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gregtatum commented Feb 9, 2021

sffc commented Feb 9, 2021

codecov-io commented Feb 10, 2021 • edited Loading

Codecov Report

gregtatum commented Feb 10, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sffc Feb 11, 2021 • edited Loading

Choose a reason for hiding this comment

sffc Feb 11, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sffc Feb 10, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gregtatum commented Feb 22, 2021

sffc commented Feb 22, 2021

gregtatum commented Feb 23, 2021 • edited Loading

sffc commented Feb 23, 2021

gregtatum commented Mar 2, 2021

gregtatum commented Mar 18, 2021

sffc left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sffc left a comment

Choose a reason for hiding this comment

gregtatum commented Feb 8, 2021 •

edited

Loading

coveralls commented Feb 8, 2021 •

edited

Loading

gregtatum Feb 9, 2021 •

edited

Loading

gregtatum Feb 9, 2021 •

edited

Loading

codecov-io commented Feb 10, 2021 •

edited

Loading

sffc Feb 11, 2021 •

edited

Loading

sffc Feb 11, 2021 •

edited

Loading

sffc Feb 10, 2021 •

edited

Loading

gregtatum commented Feb 23, 2021 •

edited

Loading