New datagen API #3386

robertbastian · 2023-04-24T15:06:34Z

Supersedes #3142

This gets rid of a lot of the current API surface (it's kept around doc(hidden) for semver). Instead of calling a top-level datagen function with Out parameters, clients now construct a DatagenProvider with SourceData and Options, and then call export on that. export accepts an impl DataExporter. This will make it easier to exclude exporter crates (icu_provider_fs, icu_provider_blob, future icu_provider_bake), as the new API can be built without them (#3365).

The new API doesn't support multiple exporters at once. The reason for that is that it's a very niche use case, it can be worked around by defining a forking data exporter, and it's still available through the old API. make-testdata still uses the old API, whereas icu4x-datagen uses the new one, this way we keep full coverage. When we remove the old API, we can move the forking exporter from datagen to make-testdata.

#3365

#3564

Manishearth · 2023-04-24T15:39:19Z

This is going to take some time to land so it's probably fine but I'd like to request we don't land this too soon (not this week, at least). @sffc has previously expressed that he considers this repo to still be in "patch 1.2" state so we shouldn't land major changes that we're not okay with having sneak out through a patch release.

(Plus I haven't finished the Google3 datagen import yet. Hopefully this week.)

robertbastian · 2023-04-24T15:52:41Z

Sure, I'd appreciate feedback already though.

Manishearth · 2023-04-24T16:13:51Z

Oh absolutely, I plan to go through this sooner than that 😄

jira-pull-request-webhook · 2023-04-25T14:49:29Z

Hooray! The files in the branch are the same across the force-push. 😃

~ Your Friendly Jira-GitHub PR Checker Bot

Manishearth

partial review

provider/datagen/Cargo.toml

provider/datagen/src/bin/datagen/mod.rs

provider/datagen/src/lib.rs

Manishearth

reviewed the whole thing. No blocking comments I'd say

Manishearth · 2023-04-27T00:40:33Z

provider/datagen/src/lib.rs

-    pub use icu_provider::KeyedDataMarker;
+    pub use icu_provider::{datagen::DataExporter, DataKey, KeyedDataMarker};
+
+    // SEMVER GRAVEYARD


💀 💀 💀 💀

provider/datagen/src/bin/datagen/mod.rs

provider/datagen/src/bin/datagen/config.rs

provider/datagen/src/source.rs

Manishearth · 2023-04-28T14:05:08Z

I am now comfortable landing this. capi may still have a patch release to fix the ssize_t issue if Makoto needs, but I think they use ICU4X from Github.

sffc

Like the design overall. Much cleaner. Mostly small nits with a couple discussion questions

ffi/diplomat/Cargo.toml

provider/blob/src/export/mod.rs

provider/datagen/Cargo.toml

provider/datagen/src/baked_exporter.rs

provider/datagen/src/bin/datagen/args.rs

provider/datagen/src/lib.rs

sffc · 2023-04-30T07:37:54Z

provider/datagen/src/transform/cldr/displaynames/language.rs

-            })
-            .map(DataLocale::from)
-            .collect())
+        Ok(self.source.options.locales.filter_by_langid_equality(


Discussion/Issue: I'm not really a fan of pulling this into each and every transformer. It seems very much prone to error.

I think I still prefer the design where the list of locales (or locale config) is provided directly to the export function instead of being in the datagen constructor.

If export decided which locales to export, any key-specific logic (like for segmentation) would need to be in export. I think that's bad design because it would literally be an if key == Marker::KEY { // choose locales } for any number of keys, whereas with this design the logic for each key is in its provider implementation.

Whether the locale config is provided to the constructor or export is independent of whether the actual logic is in export or the providers. Putting it in the constructor is just easier because passing it to supported_locales would require changing the public IterableDataProvider trait (or using internal mutability on self before calling it). I'm somewhat neutral as to where to provide it (I think export makes sense as well), but I think evaluating it in the provider impls is much cleaner.

This seems not quite right to me but it's fine to land for now, and we can discuss further. #3409

sffc · 2023-04-30T07:39:07Z

provider/datagen/src/transform/segmenter/mod.rs

+        // TODO: Do we actually want to filter these by the user-selected locales? The keys
+        // are more like script selectors...


Yes, this is the big case where we likely want to return data for locales that aren't necesarily in the user specified set.

We probably should have an additional LSTM-specific flag to allow you to choose which models to build (for which languages, grapheme/codepoint, training fidelity, etc).

I guess I'll keep this in for now to preserve behaviour?

Do we an issue for segmenter datagen locale selection? I couldn't find an open one.

Created #3408

Manishearth

r=me, want shane's approval too

sffc

I approve of the change, but I encourage you next time to put more priority on making it easy for the code reviewer.

sffc · 2023-05-05T19:28:35Z

provider/datagen/src/bin/datagen/mod.rs

+use simple_logger::SimpleLogger;
+
+mod args;
+pub mod config;


Nit: A bit confusing to have a pub mod in a bin target

Well it's pub mod to the parent which doesn't reexport it, which we also do in lib crates for private modules.

provider/datagen/src/transform/segmenter/dictionary.rs

sffc · 2023-05-05T21:19:09Z

provider/datagen/src/transform/segmenter/mod.rs

+            }
+        }
+
+        fn get_grapheme_segmenter_value_from_name(name: &str) -> GraphemeClusterBreak {


Issue: This is a complicated PR, and it is more difficult to review when you refactor code in a way that is unrelated to the change. It appears that you changed this function from being a module function to being an inline function. It would be a much smaller change if you would add cfgs to the module functions if that's all you need to make the code compile without warnings.

I didn't check if you have this in a standalone commit; please use more detailed commit messages if you prefer a commit-by-commit review.

This is the most recent commit called seg cleanup. I would have preferred to make this a standalone PR, but with the cross-timezone review cycle having chains is a massive pain.

OK thanks for the pointer; everything looks mergeable now except for the seg cleanup commit which requires more of my time so I will look at in more detail this afternoon.

If your goal is to get things reviewed faster, please don't scope-creep them in the middle of a review cycle.

@robertbastian highly recommend just making a separate branch and creating a draft PR with "based on , start reviewing at "

@sffc I think given that this is a large PR already we should tend towards merging sooner rather than later and deal with the leftover stuff as post-review.

@robertbastian highly recommend just making a separate branch and creating a draft PR with "based on , start reviewing at "

This has not worked well for me in the past, I would get approval it but it has merge conflicts, I lose approval and it takes another couple of days to get approval again.

I find rebases to be better than merges for dealing with that. But yeah that can be a pain.

My preferred solution is to make a new PR as Draft, and once the parent PR is merged, fix up the child Draft PR and then send it out for review. Avoid PR chains longer than 2 or 3 by working on some other component that is independent; the project is big enough that there are lots of options.

Co-authored-by: Shane F. Carr <shane@unicode.org>

robertbastian requested review from sffc, Manishearth and a team as code owners April 24, 2023 15:06

robertbastian force-pushed the dg branch from 2564167 to 5cd6faf Compare April 25, 2023 14:49

squash

115554b

robertbastian force-pushed the dg branch from 5cd6faf to 115554b Compare April 26, 2023 09:56

This comment was marked as spam.

Sign in to view

Manishearth reviewed Apr 26, 2023

View reviewed changes

provider/datagen/Cargo.toml Outdated Show resolved Hide resolved

provider/datagen/src/bin/datagen/mod.rs Show resolved Hide resolved

provider/datagen/src/lib.rs Show resolved Hide resolved

Manishearth reviewed Apr 27, 2023

View reviewed changes

robertbastian added 4 commits April 28, 2023 11:12

fix+comments

f2bcf19

fmt

1e4fce3

fix

43d5d81

fi

b79c05d

sffc requested changes Apr 30, 2023

View reviewed changes

robertbastian added 4 commits May 2, 2023 12:55

fix

8281da9

cargo

0286995

Merge branch 'main' into dg

c887668

deserialize_json

de195d2

robertbastian requested review from sffc and Manishearth May 2, 2023 14:54

sffc mentioned this pull request May 3, 2023

Where should locale filtering take place in the data exporter? #3409

Closed

sffc previously approved these changes May 3, 2023

View reviewed changes

Manishearth previously approved these changes May 3, 2023

View reviewed changes

Merge branch 'main' into dg

49aad77

robertbastian dismissed Manishearth’s stale review via 49aad77 May 3, 2023 18:32

robertbastian dismissed sffc’s stale review via 49aad77 May 3, 2023 18:32

a

da73744

robertbastian requested a review from sffc May 3, 2023 18:51

robertbastian added 2 commits May 3, 2023 20:52

todo

ae5f538

gen

7746a51

robertbastian requested a review from Manishearth May 4, 2023 08:27

doc

f357dbf

Manishearth previously approved these changes May 4, 2023

View reviewed changes

seg cleanup

6773515

robertbastian dismissed Manishearth’s stale review via 6773515 May 5, 2023 10:33

sffc reviewed May 5, 2023

View reviewed changes

Update provider/datagen/src/transform/segmenter/dictionary.rs

d66f90c

Co-authored-by: Shane F. Carr <shane@unicode.org>

robertbastian requested a review from sffc May 8, 2023 09:30

robertbastian mentioned this pull request May 8, 2023

Add some Debug implementations in datagen #3418

Merged

Manishearth previously approved these changes May 8, 2023

View reviewed changes

Merge branch 'main' into dg

b6346b3

robertbastian dismissed Manishearth’s stale review via b6346b3 May 8, 2023 22:26

Manishearth approved these changes May 8, 2023

View reviewed changes

sffc approved these changes May 9, 2023

View reviewed changes

robertbastian merged commit b0890a7 into unicode-org:main May 9, 2023

robertbastian deleted the dg branch May 9, 2023 12:30

robertbastian mentioned this pull request May 10, 2023

Make modes of icu4x-datagen be optional features #3365

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New datagen API #3386

New datagen API #3386

robertbastian commented Apr 24, 2023 •

edited

Loading

Manishearth commented Apr 24, 2023

robertbastian commented Apr 24, 2023

Manishearth commented Apr 24, 2023

jira-pull-request-webhook bot commented Apr 25, 2023

This comment was marked as spam.

Manishearth left a comment

Manishearth left a comment

Manishearth Apr 27, 2023

Manishearth commented Apr 28, 2023

sffc left a comment

sffc Apr 30, 2023

robertbastian May 2, 2023

robertbastian May 2, 2023

sffc May 3, 2023

sffc Apr 30, 2023

robertbastian May 2, 2023

sffc May 3, 2023

Manishearth left a comment

sffc left a comment

sffc May 5, 2023

robertbastian May 8, 2023

sffc May 5, 2023

robertbastian May 8, 2023

sffc May 8, 2023

Manishearth May 8, 2023

robertbastian May 8, 2023

Manishearth May 8, 2023

sffc May 8, 2023

		// TODO: Do we actually want to filter these by the user-selected locales? The keys
		// are more like script selectors...

New datagen API #3386

New datagen API #3386

Conversation

robertbastian commented Apr 24, 2023 • edited Loading

Manishearth commented Apr 24, 2023

robertbastian commented Apr 24, 2023

Manishearth commented Apr 24, 2023

jira-pull-request-webhook bot commented Apr 25, 2023

This comment was marked as spam.

Manishearth left a comment

Choose a reason for hiding this comment

Manishearth left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Manishearth commented Apr 28, 2023

sffc left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Manishearth left a comment

Choose a reason for hiding this comment

sffc left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

robertbastian commented Apr 24, 2023 •

edited

Loading