Replace PluralElementsV1 with PluralElementsPackedULE #5510

sffc · 2024-09-06T00:56:12Z

Depends on #5511
~~Depends (no longer) on #5520~~

sffc · 2024-09-08T11:04:08Z

This can't land until #5511 and #5520 land, and the commit history is a little messy because of it, but I thought it might be good to give some context about how I plan to use the new zerovec abstractions.

This reduces the baked data size for the finely sliced data structs. It also makes PluralElementsPackedULE (with all the proper bounds and impls) which I intend to use in datetime 2.0.

sffc · 2024-09-08T19:11:27Z

components/plurals/src/provider.rs

+                // TODO: Inform the user more nicely that their data doesn't fit in our packed structure
+                panic!("other value too long to be packed: {self:?}")


One way I could get rid of the explicit panic would be to restrict the PatternElement's T to be a certain length (create a newtype such as ShortT).

Manishearth · 2024-09-09T03:14:14Z

It also makes PluralElementsPackedULE (with all the proper bounds and impls) which I intend to use in datetime 2.0.

I've said this before, I'm really not in favor of tying the 2.0 release timeline to brand new ULE ideas. I don't think we have the time to design that properly. I don't like designing these things without getting an understanding of the variety of use cases, and we can't get that understanding with such short notice.

I'm open to doing the MultiVar fix because it's one we've wanted for ages and have a plan for that that I think will be sufficiently general and reduce unsafe surface. VarTuple is rather straightforward as well and I'm fine with that, but I'm not really in favor of this set of zerovec changes.

sffc · 2024-09-09T03:24:18Z

Hm? This PR is the subject specific ULE for packed plural patterns. There are no zerovec changes in this PR.

This PR depends on #5520 which contains zerovec changes which I think are strongly motivated and justified independent of MultiVarULE, but I can change #5520 to be plurals-specific if it can land faster that way.

sffc · 2024-09-09T03:29:20Z

This PR is largely the result of our discussion on Friday where we decided to proceed with Option 3 for packed patterns. In order to do Option 3, I needed a better solution for packed plurals.

sffc · 2024-09-09T03:32:04Z

Also, please take a look at the fingerprint file in this PR. The savings for baked data are rather substantial. Generalized MultiFieldULE would get us somewhat close, but this is I feel a mostly optimal solution. I spent a lot of time designing this PR's ULE. TinyVarVar is a component I split out because I like to generalize when possible.

Manishearth · 2024-09-09T04:12:14Z

Hm? This PR is the subject specific ULE for packed plural patterns. There are no zerovec changes in this PR.

Yes, this PR requires #5520 which is a new ULE representation that hasn't been discussed.

This PR is largely the result of our discussion on Friday where we decided to proceed with Option 3 for packed patterns. In order to do Option 3, I needed a better solution for packed plurals.

My impression of our discussion on Friday was that we decided explicitly to not add more abstractions to zerovec and instead use MultiField (and/or do manual things for patterns), which we had decided to fast track for 2.0. This feels counter to that discussion?

(To be clear: I'm fine with #5511, it's #5520 that's a surprise to me)

I can change #5520 to be plurals-specific if it can land faster that way

Yes please. Generic unsafe code is a heavier burden on maintenance and while I like to generalize where possible too, I absolutely do not want to do that last minute. We have one use case for this generic type at the moment, that is not enough for us to know whether this abstraction is at the right level of specificity.

And as far as I can tell #5520 is just (the fixed version of) MultiFieldsULE with additional support for a u8 length format: that is something I've wanted to support as well and is easy to do. This is exactly what I'm getting at when talking about figuring out if something is sufficiently generic or specific.

Also, please take a look at the fingerprint file in this PR. The savings for baked data are rather substantial.

It seems like we have a path forward for now, but for future reference I'm really not happy with the dynamic where we design new ULE things last minute when we come up with a new way to save size. I've expressed this before: I'm fine doing it when we've had time to get a grasp of the variety of use cases, but that takes time and I don't think we have time to do this well here.

The way I see things are that the data design phase for a component (which should typically be many many months before stabilization) is when we may decide to invent stuff like this, and we can give those design processes the time they deserve in such a case. In the near-stabilization data optimization phase, we might tweak the representation to pack it further, but it is too late to invent new abstractions and at best we should do component-specific manual ULE impls (I'm not overly happy with those either but I think they're fine as we still figure things out).

So I basically don't find "the savings for this are substantial" to be a valid argument here. Unfortunately pattern data design got punted close to the end of 2.0 so there never was a separate initial data design phase, but I don't think that obviates us being careful here.

Generalized MultiFieldULE would get us somewhat close, but this is I feel a mostly optimal solution.

I feel like "somewhat close" is fine? I don't think we need to fully optimize this last minute like this. My preference is rather to keep unsafe abstractions low but be "somewhat close" on datasize than proliferate them.

Manishearth · 2024-09-09T04:13:19Z

think are strongly motivated and justified independent of MultiVarULE

FWIW, I'm really not seeing this so far, either the strong motivation or the justification independent of MultiFieldsULE. I think (an improved) MultiFieldsULE probably can cover this with no data size diff¹, and even if it does; it's unclear to me that that the relatively small wins there will be worth it.

We should probably prioritize the VarULE format refactors to be done sooner rather than later so we can get numbers.

It may be tricky to make the varint repr work via VarZeroVecFormat but that's not happening here anyway, it was a potential future optimization, and if the time comes for such an optimization we can design it then. ↩

sffc · 2024-09-09T04:43:25Z

Right, packed plural patterns has long been a priority, and going with Option 3 made datetime 2.0 depend on it (which may not have been clear in our discussion on Friday)

I implemented a packed plural solution here which I think is near optimal ("optimal" defined as being at least as good as what is would be with MFULE improvements). There is some but not a lot of unsafe code, and I made an effort for the unsafe code to be reasonably to review.

I'm trying to determine the paths forward:

Merge these PRs as-is, and perhaps migrate to MFULE later when the capabilities are there. Also potentially change StrStrPair to use TinyVarVar and remove MFULE from 2.0 data
Move TinyVarVar into the icu_plurals crate and treat it as internal to PatternElementsPackedULE
Rewrite this PR to use MFULE instead of TinyVarVar, and probably accelerate plans to improve MFULE

sffc · 2024-09-09T05:02:25Z

How about this: can you give specific suggestions on how to iterate on the data struct design?

I spent hours looking at this, and I kept reaching mental blocks (one reason I scheduled that meeting on Friday), and this PR is the what I've come up with. It satisfies several constraints:

Plural patterns should be encoded in-place, rather than at the same level as lengths and variants (Option 3)
Singleton plural patterns should not take up more space than they currently do in datetime: achieved by a single discriminant bit plus FourBitMetadata (which lets us share the existing plural datetime patterns metadata byte).
At the same time, optimize the stack representation of PluralElementsPacked to be no larger than 4 usizes, which was a 2024 blocker but not a 2.0 blocker: this PR achieves it

The layout of PluralElementsPackedULE is:

/// A bitpacked DST for [`PluralElements`].
///
/// Can be put in a [`Cow`] or a [`VarZeroVec`].
#[derive(Debug, PartialEq, Eq)]
#[repr(transparent)]
pub struct PluralElementsPackedULE<V: VarULE + ?Sized> {
    _v: PhantomData<V>,
    /// Invariant Representation:
    ///
    /// First byte: `d...mmmm`
    /// - `d` = 0 if singleton, 1 if a map
    /// - `...` = padding, should be 0
    /// - `mmmm` = [`FourBitMetadata`] for the default value
    ///
    /// Remainder: either [`PluralElementsSingletonVarULE`] or [`PluralElementsMultiVarULE`]
    /// based on the discriminant `d`
    bytes: [u8],
}

/// The type of bytes[1..] if there is a singleton pattern.
type PluralElementsSingletonVarULE<V> = V;

/// The type of bytes[1..] if there are special patterns.
type PluralElementsMultiVarULE<V> = TinyVarVarULE<V, PluralElementsTupleSliceVarULE<V>>;

/// The type of the special patterns list.
type PluralElementsTupleSliceVarULE<V> = VarZeroSlice<VarTupleULE<PluralCategoryAndMetadata, V>>;

I think I still satisfy my objectives if I were to switch PluralElementsMultiVarULE<V> to use MultiFieldVarULE, because that would only impact the cases where there is more than one plural form, which is not the common case.

sffc · 2024-09-09T05:14:10Z

Another design would be to merge TinyVarVarULE into the representation invariants of PluralElementsPackedULE

First byte: d...mmmm
d = 0 if singleton, 1 if a map
... = padding
mmmm = FourBitMetadata for the default value

If d is 0:
  Remainder: a single instance of V

If d is 1:
  Second byte: wwwwwwww = length of the first (default) V
  Next: an instance of V of length wwwwwwww
  Remainder: VarZeroSlice of plural specials

This is identical to the current byte representation, but without the objectionable TinyVarVarULE being split out.

Manishearth · 2024-09-09T10:09:19Z

Right, packed plural patterns has long been a priority, and going with Option 3 made datetime 2.0 depend on it (which may not have been clear in our discussion on Friday)

Yes, but we never designed the data for them early.

How about this: can you give specific suggestions on how to iterate on the data struct design?

To be clear: I'm fine with the underlying data model. I'm not fine with the introduction of TinyVarVar. I would support:

Doing a one-off manual VarULE type for the specific types in question (not generic), that lives in the component crate. I think this is what you suggest in your comment above?
"Rewrite this PR to use MFULE instead of TinyVarVar, and probably accelerate plans to improve MFULE"

So I think the data struct design is fine, I also think us allowing for some inefficiencies is fine, so a less efficient data model would also be okay, but I don't think that's actually a strong tradeoff we're facing at this moment if we fix MFULE.

sffc · 2024-09-09T16:58:03Z

I need PluralElementsPackedULE<V> to be generic over V because we already have cases for multiple types of V.

Manishearth · 2024-09-09T17:11:44Z

@sffc that's fine, it should be generic over one parameter then. Not ideal, but we're hoping to get past that anyway.

Ideally I'd suggest wrapping MultiFields and only having a little bit of unsafe, but doing a custom thing is fine as well.

Switching MultiFields over to having Format, and supporting Index8, are both very straightforward changes overall which we can easily do today if you need those wins. It's not all of the wins we need here but it covers some of them.

Manishearth · 2024-09-09T21:02:55Z

Since I'm hoping to get rid of the TinyVarVar implementation anyway and you've already done the work I'm also fine with the minimal change in this PR of moving that code into the plurals crate, making it private, and wrapping it to give you PluralElementsPackedULE<V>. If this ends up being a part of the release, that's also fine.

I wish to be very careful about additional unsafe code in zerovec that's going to make it for the 2.0 release, since I'm gearing up to get unsafe review for that crate and that's going to take quite a while.

robertbastian

What I'd like to see in this PR:

struct PluralElementsV1<'data> {
   // your custom encoding
}

// reverted, i.e. a wrapper around `PluralElementsV1` until we make it generic
struct PluralPatterns<'data, B: PatternBackend> {
   pub strings: PluralElementsV1<'data>,
   pub _phantom: PhantomData<B>,
}

What we can then do in a subsequent PR:

struct PluralElementsV1<'data, T: VarULE> {
   // your custom encoding
}

type PluralPatterns<'data, PatternBackend> =
        PluralElementsV1<'data, Pattern<B, B::Store>>;

sffc · 2024-09-10T17:24:21Z

struct PluralElementsV1<'data> {
   // your custom encoding
}

With a hard-coded str or with V?

robertbastian · 2024-09-10T17:26:05Z

For now with a hardcoded str, making it generic and removing PluralPatterns can be a followup. It's actually quite involved (I just tried it at head) because Pattern is missing a lot of implementations (Deref, ToOwned, even VarULE).

sffc · 2024-09-10T17:27:11Z

ok, coming right up.

I want to name it PluralElementsPackedCowStr

sffc · 2024-09-10T17:32:48Z

type PluralPatterns<'data, PatternBackend> =
        PluralElementsV1<'data, Pattern<B, B::Store>>;

It sounds like you agree, but I don't plan to do this in this PR

sffc · 2024-09-10T17:57:01Z

Done. There are now a total of only 6 lines changed in icu_experimental and they are all either changing imports or changing PluralElements<T> so that it doesn't always borrow T. This in effect means that my PR is fully internal to icu_plurals from the perspective of icu_experimental. I do not seek to scope-creep this PR into icu_experimental.

robertbastian · 2024-09-10T17:53:34Z

components/plurals/src/provider.rs

@@ -368,115 +375,705 @@ pub enum PluralElementsKeysV1 {
 // TODO: Make the str generic
 pub struct PluralElementsFieldV1<'data>(pub PluralElementsKeysV1, pub Cow<'data, str>);


I think this is dead

robertbastian · 2024-09-10T17:54:08Z

components/plurals/src/provider.rs

-#[cfg_attr(feature = "datagen", databake(path = icu_plurals::provider))]
-#[yoke(prove_covariance_manually)]
-/// A data-struct-optimized representation of [`PluralElements`].
-pub struct PluralElementsV1<'data> {


nit: move the new struct to the top above all the implementation details

This is still 200 lines below where it used to be, after things like struct PluralElementsUnpackedBytes<'a>. This is the most important type in this file after the data struct definitions, it shouldn't be burried in a bunch of implementation details and private structs.

My mental model of file structure for structs like struct A(B, C), struct B(D) is to declare the types and impls in the order A, B, D, C, as this makes the data structure and control flow in impls flow with the reading direction. What you're doing is pretty much the reverse, which it makes it very hard to find relevant types in a file.

robertbastian · 2024-09-10T17:55:17Z

components/plurals/src/provider.rs

+where
+    T: PartialEq,
+{
+    fn get_specials_tuples(&self) -> impl Iterator<Item = (PluralElementsKeysV1, &T)> {


nit: this has a single call site, which would be more readable if this wasn't defined on another type

I disagree that the call site would be more readable. It is the job of PluralElements to provide an iterator over the non-empty values it contains. I would consider making this a public function of PluralElements, but I didn't want to scope creep, so I left it in the provider module.

robertbastian · 2024-09-10T17:58:03Z

components/experimental/src/dimension/provider/extended_currency.rs

@@ -43,5 +43,5 @@ pub struct CurrencyExtendedDataV1<'data> {
    ///    Regards to the [Unicode Report TR35](https://unicode.org/reports/tr35/tr35-numbers.html#Currencies),
    ///    If no matching for specific count, the `other` count will be used.
    #[cfg_attr(feature = "serde", serde(borrow))]
-    pub display_names: PluralElementsV1<'data>,
+    pub display_names: PluralElementsPackedCowStr<'data>,


not a fan of calling this PackedCowStr. I think PluralElementsV1 was fine as it's our first iteration of PluralElements. The user here really doesn't need to care about encoding details.

We don't typically call things V1 unless they are actually data structs. I let it land with this name initially because it's unstable, but I consider it an improvement to call the type by what it actually is.

robertbastian · 2024-09-10T18:00:44Z

components/plurals/src/provider.rs

+}
+
+/// Helper function to access a value from [`PluralElementsTupleSliceVarULE`]
+fn get_special<V: VarULE + ?Sized>(


nit: this can be in get

I see this as an associated function on type PluralElementsTupleSliceVarULE<V>, but I can't add associated functions to typedefs. I also stylistically prefer to keep functions at the module level because there is less nesting / indentation.

But, I have the sense that you feel more strongly about this style (fully encapsulating private functions and structs when possible) than I do, so I'll move it for you.

Actually, I ended up calling this function in one more place (the new serde impl) so I'm leaving it where it is.

robertbastian · 2024-09-10T18:03:44Z

components/plurals/src/provider.rs

+    V: VarULE + PartialEq + ?Sized,
+{
+    #[cfg(feature = "datagen")]
+    fn to_plural_elements(&self) -> PluralElements<(FourBitMetadata, &V)> {


this is also only used by a single call site, serde::Serialize. I'd prefer if that defined its own local struct with a derive(Serialize), so the PluralElements doesn't have to expose Serialize (it's not great having a public type be Serialize if it doesn't work for non-human readable formats).

Technically PluralElements does serialize to non-human-readable formats, it's just not optimized for it. But, we should discourage that. I'm convinced; I'll make a private struct for the serialization.

robertbastian · 2024-09-10T18:05:06Z

components/plurals/src/provider.rs

+///
+/// See <https://github.com/unicode-org/icu4x/pull/1556>
+#[cfg(feature = "serde")]
+pub fn deserialize_plural_elements_packed_cow<'de, 'data, D, V>(


this can be private, if you're using a cow you might as well use the PluralElementsPackedCowStr struct

I left it public because if you want to deserialize something other than a str, you'll need it. I consider it part of the value proposition of PatternElementsPackedULE.

I managed to make PluralElementsPackedCow be generic, so now I am able to make the fn be private.

robertbastian · 2024-09-10T18:43:02Z

provider/data/experimental/fingerprints.csv

observation: postcard size is slightly increasing, baked is decreasing

-currency/extended@1, en/ESA, 108B, 62B, a051b393cf885669 -currency/extended@1, en/ESB, 128B, 82B, 610579242f4b65f9 -currency/extended@1, en/ESP, 84B, 38B, 34b7b7dccf3b5fd2 +currency/extended@1, en/ESA, 86B, 63B, d05fac7f1be65e18 +currency/extended@1, en/ESB, 106B, 83B, 91a6ef580bbe95f3 +currency/extended@1, en/ESP, 62B, 39B, 2e4f2ff2266524df

There's no postcard size difference for data structs that have only the other plural. I guess where there are specials, the bytes get one byte bigger because there's a new discriminant bit (pulling something that was previously in postcard land into zerovec land).

Previously:

// No Specials [postcard length] [other pattern] [postcard length 0] // With Specials [postcard length] [other pattern] [postcard length] [specials]

Now:

// No Specials [postcard length] [discriminant] [other pattern] // With Specials [postcard length] [discriminant] [other length] [other pattern] [specials]

sffc · 2024-09-10T23:47:22Z

components/plurals/src/provider.rs

+}
+
+// Need a manual impl because the derive(Clone) impl bounds are wrong
+impl<'data, V> Clone for PluralElementsPackedCow<'data, V>


rust-lang/rust#26925

sffc · 2024-09-11T00:33:13Z

I think I've gone above and beyond on this PR. Ready to merge.

Note: Some of the CI jobs are being killed with exit code 143, but re-running fixes them.

robertbastian · 2024-09-11T00:00:50Z

components/plurals/src/provider.rs

-#[cfg_attr(feature = "datagen", databake(path = icu_plurals::provider))]
-#[yoke(prove_covariance_manually)]
-/// A data-struct-optimized representation of [`PluralElements`].
-pub struct PluralElementsV1<'data> {


This is still 200 lines below where it used to be, after things like struct PluralElementsUnpackedBytes<'a>. This is the most important type in this file after the data struct definitions, it shouldn't be burried in a bunch of implementation details and private structs.

My mental model of file structure for structs like struct A(B, C), struct B(D) is to declare the types and impls in the order A, B, D, C, as this makes the data structure and control flow in impls flow with the reading direction. What you're doing is pretty much the reverse, which it makes it very hard to find relevant types in a file.

sffc force-pushed the packedpattern branch from cfac26d to 1f5b134 Compare September 8, 2024 08:19

sffc changed the title ~~Start sketching out PackedSkeletonDataV2~~ Add PatternElementsPackedULE and use it Sep 8, 2024

sffc requested review from Manishearth and robertbastian September 8, 2024 11:01

sffc marked this pull request as ready for review September 8, 2024 11:01

sffc requested review from zbraniecki, pdogr and a team as code owners September 8, 2024 11:01

This was referenced Sep 8, 2024

Add TinyVarVar and TinyVarVarULE #5520

Closed

Start sketching out PackedSkeletonDataV2 #5521

Draft

sffc force-pushed the packedpattern branch from d32711b to 90c8b82 Compare September 8, 2024 19:06

sffc commented Sep 8, 2024

View reviewed changes

sffc added 3 commits September 9, 2024 13:40

Add TinyVarVar

1f151e0

Fix typo bug on main

79582ad

Add PluralElementsPackedULE and use it in relativetime

0f7223b

sffc force-pushed the packedpattern branch from 90c8b82 to 0f7223b Compare September 9, 2024 20:40

sffc changed the title ~~Add PatternElementsPackedULE and use it~~ Replace PluralElementsV1 with PatternElementsPackedULE Sep 10, 2024

get_specials_tuples

effaef0

robertbastian reviewed Sep 10, 2024

View reviewed changes

sffc added 4 commits September 10, 2024 10:46

add PluralElementsPackedCowStr

687cf68

Revert all changes in the experimental crate

5a3101f

Small changes to experimental crate call sites

c33063c

cargo make bakeddata experimental

594863b

doc, fmt, missing_apis

1b055db

robertbastian reviewed Sep 10, 2024

View reviewed changes

robertbastian changed the title ~~Replace PluralElementsV1 with PatternElementsPackedULE~~ Replace PluralElementsV1 with PluralElementsPackedULE Sep 10, 2024

robertbastian reviewed Sep 10, 2024

View reviewed changes

sffc added 3 commits September 10, 2024 13:46

Don't expose serde impl on PluralElements

7714c3b

rm PluralElementsField

e858f66

Move struct definitions closer together

382c846

sffc requested a review from robertbastian September 10, 2024 21:00

sffc added 3 commits September 10, 2024 14:08

clippy

f8c10e5

Make PluralElementsPackedCow generic

9670ab8

cargo make bakeddata experimental

3b307df

sffc commented Sep 10, 2024

View reviewed changes

Make the fn private now that PluralElementsPackedCow is generic

55c6d25

robertbastian approved these changes Sep 11, 2024

View reviewed changes

robertbastian merged commit c0e2dbc into unicode-org:main Sep 11, 2024
28 checks passed

sffc deleted the packedpattern branch September 11, 2024 15:25

sffc mentioned this pull request Sep 18, 2024

Use Pattern more directly #5531

Merged

		// TODO: Inform the user more nicely that their data doesn't fit in our packed structure
		panic!("other value too long to be packed: {self:?}")

		@@ -368,115 +375,705 @@ pub enum PluralElementsKeysV1 {
		// TODO: Make the str generic
		pub struct PluralElementsFieldV1<'data>(pub PluralElementsKeysV1, pub Cow<'data, str>);

Replace PluralElementsV1 with PluralElementsPackedULE #5510

Replace PluralElementsV1 with PluralElementsPackedULE #5510

Conversation

sffc commented Sep 6, 2024 • edited Loading

sffc commented Sep 8, 2024

Choose a reason for hiding this comment

Manishearth commented Sep 9, 2024

sffc commented Sep 9, 2024

sffc commented Sep 9, 2024

sffc commented Sep 9, 2024

Manishearth commented Sep 9, 2024

Manishearth commented Sep 9, 2024 • edited Loading

Footnotes

sffc commented Sep 9, 2024

sffc commented Sep 9, 2024

sffc commented Sep 9, 2024 • edited Loading

Manishearth commented Sep 9, 2024

sffc commented Sep 9, 2024

Manishearth commented Sep 9, 2024

Manishearth commented Sep 9, 2024

robertbastian left a comment

Choose a reason for hiding this comment

sffc commented Sep 10, 2024

robertbastian commented Sep 10, 2024

sffc commented Sep 10, 2024

sffc commented Sep 10, 2024 • edited Loading

sffc commented Sep 10, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sffc Sep 10, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sffc Sep 10, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sffc commented Sep 11, 2024

Choose a reason for hiding this comment

sffc commented Sep 6, 2024 •

edited

Loading

Manishearth commented Sep 9, 2024 •

edited

Loading

sffc commented Sep 9, 2024 •

edited

Loading

sffc commented Sep 10, 2024 •

edited

Loading

sffc Sep 10, 2024 •

edited

Loading

sffc Sep 10, 2024 •

edited

Loading