Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding segmenter model option to datagen #3669

Merged
merged 3 commits into from
Jul 22, 2023

Conversation

robertbastian
Copy link
Member

@robertbastian robertbastian commented Jul 12, 2023

Fixes #3408

Explicitly removing the cjdict from testdata, as it's a 10MB JSON file.

Copy link
Member

@sffc sffc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Praise: Clean solution

Comment on lines +59 to +60
"thaidict".into(),
"Thai_codepoints_exclusive_model4_heavy".into(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue: It's definitely a breaking change to remove cjdict from icu_testdata. I agree it would be nice to get rid of the JSON file but it's more important that it stays in the postcard file we ship. Let's at least split that to its own PR that we can discuss and not hold up this PR on it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the non-shipping testdata. Which reminds me, I should update the other testdata script.

provider/datagen/src/options.rs Show resolved Hide resolved
@robertbastian robertbastian requested a review from sffc July 13, 2023 13:22
@Manishearth Manishearth removed their request for review July 13, 2023 18:58
@robertbastian robertbastian merged commit 10ab02f into unicode-org:main Jul 22, 2023
23 checks passed
@sffc
Copy link
Member

sffc commented Jul 22, 2023

This broke main CI

[cargo-make] INFO - Running Task: testdata-check
[cargo-make] ERROR - Error while running duckscript: Source: Unknown Line: 13 - 

Test data needs to be updated. Please run `cargo make download-repo-sources`, `cargo make testdata` and `cargo make testdata-hello-world:

?? provider/datagen/tests/data/json/segmenter/dictionary/w_auto@1/

@sffc sffc mentioned this pull request Jul 23, 2023
@robertbastian robertbastian deleted the segopts branch August 8, 2023 10:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Tweak how locales are generated in icu_segmenter datagen
2 participants