Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replacing datagen options by DatagenDriver #3861

Merged
merged 27 commits into from
Aug 17, 2023

Conversation

robertbastian
Copy link
Member

@robertbastian robertbastian commented Aug 14, 2023

The new architecture has two main structs: DatagenProvider and DatagenDriver. SourceData is deprecated and put behind the legacy_api flag, data is now added directly to the DatagenProvider. DatagenDriver is used to configure export options, and contains the export logic such as locale selection. These two structs could in the future live in different crates.

Misc fixes:

  • Made JSON syntax more idiomatic
  • Made JSON configs self-normalizing
  • Moved segmentation data from data into tests/data. This is required as the new API does not include fallback data, so the segmentation dictionaries have to be in the correct location inside tests/data/icuexport. The legacy API thus pulls in data from tests/data, but it does so using include_str! which I hope is fine.

Fixes #3795, #3800

Part of #3564

Copy link
Member

@sffc sffc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick architectural review; mostly looks OK but leaving Requested Changes for some smaller things

provider/blob/tests/data/config.json Show resolved Hide resolved
provider/datagen/src/transform/cldr/characters/mod.rs Outdated Show resolved Hide resolved
provider/datagen/src/driver.rs Show resolved Hide resolved
provider/datagen/src/source.rs Outdated Show resolved Hide resolved
components/calendar/data/config.json Show resolved Hide resolved
@robertbastian robertbastian changed the title Replacing datagen options by DataExportDriver Replacing datagen options by DatagenDriver Aug 15, 2023
/// are excluded. This method can be used to reennable them.
///
/// The special string `"search*"` causes all search collation tables to be included.
pub fn with_collations(self, collations: impl IntoIterator<Item = String>) -> Self {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be called with_addtional_collations or something? Because most collations are preselected and cannot be removed

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good question and I like your suggestion; let's not block this pr on the resolution but we should decide before 1.3

provider/datagen/src/source.rs Outdated Show resolved Hide resolved
Copy link
Member

@sffc sffc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more comments

I'd like to see this merged soon; I don't like big PRs like this open for a long time.

provider/datagen/src/bin/datagen/args.rs Outdated Show resolved Hide resolved
provider/datagen/src/bin/datagen/config.rs Show resolved Hide resolved
sffc
sffc previously approved these changes Aug 15, 2023
Copy link
Member

@sffc sffc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Praise: this is a great refactoring and I really like the new API. I have some nitty comments but the overall architecture looks great!

/// [`icu_datagen::all_keys`]: crate::all_keys
/// [`icu_datagen::key`]: crate::key
/// [`icu_datagen::keys_from_bin`]: crate::keys_from_bin
pub fn with_keys(self, keys: impl IntoIterator<Item = DataKey>) -> Self {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about impl IntoIterator -- if we need the vec or hashset then why not just take the vec or hashset? It's nice that it keeps the internal collection hidden, but if you already have a hashset then this is slower. (ok to discuss in a follow up)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found this to be much more ergonomic. In tests I can just pass arrays, and also keys, keys_for_bin and keys_from_file return Vec<DataKey>, so you always end up writing .into_iter().collect() at the call site.

/// are excluded. This method can be used to reennable them.
///
/// The special string `"search*"` causes all search collation tables to be included.
pub fn with_collations(self, collations: impl IntoIterator<Item = String>) -> Self {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good question and I like your suggestion; let's not block this pr on the resolution but we should decide before 1.3

provider/datagen/src/driver.rs Show resolved Hide resolved
provider/datagen/src/driver.rs Outdated Show resolved Hide resolved
provider/datagen/src/driver.rs Show resolved Hide resolved
provider/datagen/src/transform/cldr/source.rs Show resolved Hide resolved
provider/datagen/tests/test-options.rs Show resolved Hide resolved
provider/testdata/src/lib.rs Show resolved Hide resolved
Manishearth
Manishearth previously approved these changes Aug 15, 2023
@robertbastian robertbastian dismissed stale reviews from Manishearth and sffc via 4110c15 August 15, 2023 20:20
sffc
sffc previously approved these changes Aug 16, 2023
Copy link
Member

@sffc sffc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We really should either change the bounds on the export function to be stable (by loading the likely subtags data some roundabout way via ExportMarker) or make the function be named export_unstable but I'm okay doing that after landing the mammoth PR so long as it is tracked to block 1.3

@robertbastian
Copy link
Member Author

We should load it through ExportMarker somehow, but that type is such a mess I'll try that in a follow-up.

@@ -147,12 +138,7 @@ impl DatagenDriver {
// Avoids multiple monomorphizations
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: only sink gets a trait object; is it worth it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

idk, but provider is not object safe so it can't

@robertbastian robertbastian merged commit ea1ba9f into unicode-org:main Aug 17, 2023
25 checks passed
@robertbastian robertbastian deleted the driver branch August 17, 2023 17:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Should Datagen Options be Copy?
3 participants