Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New datagen API #3386

Merged
merged 17 commits into from
May 9, 2023
Merged

New datagen API #3386

merged 17 commits into from
May 9, 2023

Conversation

robertbastian
Copy link
Member

@robertbastian robertbastian commented Apr 24, 2023

Supersedes #3142

This gets rid of a lot of the current API surface (it's kept around doc(hidden) for semver). Instead of calling a top-level datagen function with Out parameters, clients now construct a DatagenProvider with SourceData and Options, and then call export on that. export accepts an impl DataExporter. This will make it easier to exclude exporter crates (icu_provider_fs, icu_provider_blob, future icu_provider_bake), as the new API can be built without them (#3365).

The new API doesn't support multiple exporters at once. The reason for that is that it's a very niche use case, it can be worked around by defining a forking data exporter, and it's still available through the old API. make-testdata still uses the old API, whereas icu4x-datagen uses the new one, this way we keep full coverage. When we remove the old API, we can move the forking exporter from datagen to make-testdata.

#3365

#3564

@robertbastian robertbastian requested review from sffc, Manishearth and a team as code owners April 24, 2023 15:06
@Manishearth
Copy link
Member

This is going to take some time to land so it's probably fine but I'd like to request we don't land this too soon (not this week, at least). @sffc has previously expressed that he considers this repo to still be in "patch 1.2" state so we shouldn't land major changes that we're not okay with having sneak out through a patch release.

(Plus I haven't finished the Google3 datagen import yet. Hopefully this week.)

@robertbastian
Copy link
Member Author

Sure, I'd appreciate feedback already though.

@Manishearth
Copy link
Member

Oh absolutely, I plan to go through this sooner than that 😄

@jira-pull-request-webhook
Copy link

Hooray! The files in the branch are the same across the force-push. 😃

~ Your Friendly Jira-GitHub PR Checker Bot

@jira-pull-request-webhook

This comment was marked as spam.

Copy link
Member

@Manishearth Manishearth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

partial review

provider/datagen/Cargo.toml Outdated Show resolved Hide resolved
provider/datagen/src/bin/datagen/mod.rs Show resolved Hide resolved
provider/datagen/src/lib.rs Show resolved Hide resolved
Copy link
Member

@Manishearth Manishearth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reviewed the whole thing. No blocking comments I'd say

pub use icu_provider::KeyedDataMarker;
pub use icu_provider::{datagen::DataExporter, DataKey, KeyedDataMarker};

// SEMVER GRAVEYARD
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💀 💀 💀 💀

provider/datagen/src/bin/datagen/mod.rs Show resolved Hide resolved
provider/datagen/src/bin/datagen/config.rs Outdated Show resolved Hide resolved
provider/datagen/src/source.rs Show resolved Hide resolved
@Manishearth
Copy link
Member

I am now comfortable landing this. capi may still have a patch release to fix the ssize_t issue if Makoto needs, but I think they use ICU4X from Github.

Copy link
Member

@sffc sffc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like the design overall. Much cleaner. Mostly small nits with a couple discussion questions

ffi/diplomat/Cargo.toml Outdated Show resolved Hide resolved
provider/blob/src/export/mod.rs Show resolved Hide resolved
provider/datagen/Cargo.toml Outdated Show resolved Hide resolved
provider/datagen/src/baked_exporter.rs Outdated Show resolved Hide resolved
provider/datagen/src/baked_exporter.rs Show resolved Hide resolved
provider/datagen/src/bin/datagen/args.rs Outdated Show resolved Hide resolved
provider/datagen/src/bin/datagen/args.rs Show resolved Hide resolved
provider/datagen/src/lib.rs Show resolved Hide resolved
})
.map(DataLocale::from)
.collect())
Ok(self.source.options.locales.filter_by_langid_equality(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussion/Issue: I'm not really a fan of pulling this into each and every transformer. It seems very much prone to error.

I think I still prefer the design where the list of locales (or locale config) is provided directly to the export function instead of being in the datagen constructor.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If export decided which locales to export, any key-specific logic (like for segmentation) would need to be in export. I think that's bad design because it would literally be an if key == Marker::KEY { // choose locales } for any number of keys, whereas with this design the logic for each key is in its provider implementation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whether the locale config is provided to the constructor or export is independent of whether the actual logic is in export or the providers. Putting it in the constructor is just easier because passing it to supported_locales would require changing the public IterableDataProvider trait (or using internal mutability on self before calling it). I'm somewhat neutral as to where to provide it (I think export makes sense as well), but I think evaluating it in the provider impls is much cleaner.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems not quite right to me but it's fine to land for now, and we can discuss further. #3409

Comment on lines 829 to 830
// TODO: Do we actually want to filter these by the user-selected locales? The keys
// are more like script selectors...
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is the big case where we likely want to return data for locales that aren't necesarily in the user specified set.

We probably should have an additional LSTM-specific flag to allow you to choose which models to build (for which languages, grapheme/codepoint, training fidelity, etc).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I'll keep this in for now to preserve behaviour?

Do we an issue for segmenter datagen locale selection? I couldn't find an open one.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created #3408

sffc
sffc previously approved these changes May 3, 2023
Manishearth
Manishearth previously approved these changes May 3, 2023
@robertbastian robertbastian requested a review from sffc May 3, 2023 18:51
Manishearth
Manishearth previously approved these changes May 4, 2023
Copy link
Member

@Manishearth Manishearth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

r=me, want shane's approval too

Copy link
Member

@sffc sffc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I approve of the change, but I encourage you next time to put more priority on making it easy for the code reviewer.

use simple_logger::SimpleLogger;

mod args;
pub mod config;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: A bit confusing to have a pub mod in a bin target

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well it's pub mod to the parent which doesn't reexport it, which we also do in lib crates for private modules.

provider/datagen/src/transform/segmenter/dictionary.rs Outdated Show resolved Hide resolved
}
}

fn get_grapheme_segmenter_value_from_name(name: &str) -> GraphemeClusterBreak {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue: This is a complicated PR, and it is more difficult to review when you refactor code in a way that is unrelated to the change. It appears that you changed this function from being a module function to being an inline function. It would be a much smaller change if you would add cfgs to the module functions if that's all you need to make the code compile without warnings.

I didn't check if you have this in a standalone commit; please use more detailed commit messages if you prefer a commit-by-commit review.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the most recent commit called seg cleanup. I would have preferred to make this a standalone PR, but with the cross-timezone review cycle having chains is a massive pain.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK thanks for the pointer; everything looks mergeable now except for the seg cleanup commit which requires more of my time so I will look at in more detail this afternoon.

If your goal is to get things reviewed faster, please don't scope-creep them in the middle of a review cycle.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@robertbastian highly recommend just making a separate branch and creating a draft PR with "based on , start reviewing at "

@sffc I think given that this is a large PR already we should tend towards merging sooner rather than later and deal with the leftover stuff as post-review.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@robertbastian highly recommend just making a separate branch and creating a draft PR with "based on , start reviewing at "

This has not worked well for me in the past, I would get approval it but it has merge conflicts, I lose approval and it takes another couple of days to get approval again.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find rebases to be better than merges for dealing with that. But yeah that can be a pain.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My preferred solution is to make a new PR as Draft, and once the parent PR is merged, fix up the child Draft PR and then send it out for review. Avoid PR chains longer than 2 or 3 by working on some other component that is independent; the project is big enough that there are lots of options.

Co-authored-by: Shane F. Carr <shane@unicode.org>
Manishearth
Manishearth previously approved these changes May 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants