Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exotic Types as fields of data provider structs #523

Closed
sffc opened this issue Mar 3, 2021 · 3 comments · Fixed by #699
Closed

Exotic Types as fields of data provider structs #523

sffc opened this issue Mar 3, 2021 · 3 comments · Fixed by #699
Assignees
Labels
C-data-infra Component: provider, datagen, fallback, adapters S-tiny Size: Less than an hour (trivial fixes) T-docs-tests Type: Code change outside core library
Milestone

Comments

@sffc
Copy link
Member

sffc commented Mar 3, 2021

In #480, @gregtatum is proposing that we put an exotic type (in this case Pattern) in the data struct definition, and I expect @echeran will do something similar in enumerated properties. This is novel in the data provider architecture, and I want to discuss the implications.

Definition: When I say "exotic type", I mean an opaque type defined by ICU4X that implements Serde in a non-obvious way. I do not consider TinyStr and LiteMap to be exotic types, because they are serialization-compatible with String and HashMap.

Advantages of Exotic Types

When we put exotic types in data structs, it means that the serialized form is more opaque to the reader of the code. If we used only simple types in the data structs, it makes it easier to reason about the lifecycle of data structs.

However, as Greg pointed out in #480, putting exotic types in the data provider means that deserialization into those exotic types happens during data loading rather than in the constructor, which is a nice advantage that is consistent with the "pre-processing of data" design goal I outlined in data_provider.md.

On the other hand, we could design data structs to contain "low-level" representations of ICU4X concepts, like the list of tokens in a Pattern rather than the CLDR syntax for a Pattern. This would resolve the "pre-processing of data" issue, but it may tie our hands a bit more by encoding the internal representation of types into the data struct.

Stability

As a reminder, the reason we have versioned data keys and data structs (like plurals/ordinal@1 and PluralRulesV1) is so that a data file from a newer ICU4X can power code from an older ICU4X (and vice-versa as much as possible), as discussed in data_pipeline.md.

We need to think about what implications exotic types have on data struct stability. For example, if you have a struct like FooV1 { bar: Bar }, then the serialized form of Bar must also be stable.

Here are some implications to start with:

  1. If a field of a data struct is an enum, then we need to make sure the serialized form of the enum is stable: both the string representation (for JSON) and presumably the discriminant (integer) version as well.
  2. I don't think #[non_exhaustive] can work for fields of data structs, because if a new variant gets added to an enum, for example, a newer data file, which might have the new enum variant, won't be able to power an older ICU4X client, which doesn't know how to interpret that variant.

Portability

Currently, the Rust structs are the source of truth for the schema of the serialized files. However, I want us to keep the door open to using a cross-platform declarative syntax like JSON Schema or YANG to define the data struct schemas.

In other words, the serialized form of exotic types should be rooted in the building blocks of modern data markup languages: strings, numbers, arrays, and maps, with possible restrictions on those types (e.g., a string that is 1 character long, or a number that is in a certain range).

Portability is also an argument in favor of making the deserialization step as simple as possible. Going back to the Pattern exotic type, if I wanted to implement a data provider in JavaScript that serves a data struct across the FFI boundary, I would have to implement CLDR pattern parsing code.

@sffc sffc added discuss Discuss at a future ICU4X-SC meeting C-data-infra Component: provider, datagen, fallback, adapters labels Mar 3, 2021
@sffc
Copy link
Member Author

sffc commented Mar 9, 2021

Notes from 2021-03-09:

  • @gregtatum: My expectation is that the human-readable CLDR pattern strings are used in JSON, and the array of fields is used in Bincode.
  • @sffc: Deserialization needs to be plumbed to support Bincode having a different form than JSON
  • @zbraniecki: We could use a flag to enable or disable the pattern parsing code
  • @sffc: The flag could enable or disable JSON
  • @Manishearth: If we make these structs #[repr(C)], then everything inside of them needs to be #[repr(C)], or else we could make mirror structs for FFI that are convertible to and from the Rust representation. Mirror structs could be useful to make it harder to accidentally change the FFI representation.
  • @sffc: How do we enforce serialization compatibility?
  • @Manishearth: We could have a test that runs on older versions of serialized data and ensures that the data can be deserialized from that form, and serialized back to the same form.
  • @sffc: We can also introduce a JSON Schema for the data structs, which will flag any problems if the serialized form changes.

Decision: Exotic types in data structs are OK. Add some documentation in the style guide explaining when to use them.

@sffc
Copy link
Member Author

sffc commented Mar 26, 2021

After #539 is implemented, I would like to spend some time profiling code size. I have trouble seeing how the compiler could know that it can eliminate the is_human_readable code path without additional help, especially when erased_serde is being used. Since we're now using erased_serde over FFI, we need a story for how we eliminate the JSON deserialization code path when building with erased_serde.

@sffc
Copy link
Member Author

sffc commented Mar 27, 2021

In documenting this, also discuss expectations for validation of user input, which will be necessary in Serde impls of exotic types.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-data-infra Component: provider, datagen, fallback, adapters S-tiny Size: Less than an hour (trivial fixes) T-docs-tests Type: Code change outside core library
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant