Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Apply new structure to public Dictionary/LSTM data markers #3267

Merged
merged 1 commit into from
Apr 6, 2023

Conversation

sffc
Copy link
Member

@sffc sffc commented Apr 5, 2023

Fixes #2905

There are now three keys:

"segmenter/lstm/wl_auto@1",
"segmenter/dictionary/w_auto@1",
"segmenter/dictionary/wl_ext@1",

These keys reflect the following facts:

  1. Complex breaking is needed for CJK word break but not line break (note: it may be added in the future with the Taiwanese phrase break)
  2. There are currently no locales where the dictionary is required for line break if LSTM is available

The keys can be easily extended in the future if more locales are added to either LSTM or dictionary.

- Splits dictionary data into two keys
Dictionary {
cj,
..Default::default()
}
}

pub(crate) fn load_chinese_japanese<
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: you can keep a single load function and make it generic in the marker.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure; will fix in a follow up

@sffc sffc merged commit 6871b65 into unicode-org:main Apr 6, 2023
@sffc sffc deleted the segmenter-data-split branch April 6, 2023 16:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support datagen with dictionary for some locales and LSTM for others
3 participants