Apply new structure to public Dictionary/LSTM data markers #3267

sffc · 2023-04-05T23:33:51Z

There are now three keys:

"segmenter/lstm/wl_auto@1",
"segmenter/dictionary/w_auto@1",
"segmenter/dictionary/wl_ext@1",

These keys reflect the following facts:

Complex breaking is needed for CJK word break but not line break (note: it may be added in the future with the Taiwanese phrase break)
There are currently no locales where the dictionary is required for line break if LSTM is available

The keys can be easily extended in the future if more locales are added to either LSTM or dictionary.

- Splits dictionary data into two keys

robertbastian · 2023-04-06T09:36:38Z

experimental/segmenter/src/complex.rs

        Dictionary {
            cj,
            ..Default::default()
        }
    }

+    pub(crate) fn load_chinese_japanese<


nit: you can keep a single load function and make it generic in the marker.

Sure; will fix in a follow up

Apply new structure to public Dictionary/LSTM data markers

9740604

- Splits dictionary data into two keys

sffc requested review from Manishearth, robertbastian, aethanyc, makotokato and a team as code owners April 5, 2023 23:33

sffc removed request for a team, makotokato and Manishearth April 5, 2023 23:34

Manishearth approved these changes Apr 6, 2023

View reviewed changes

robertbastian approved these changes Apr 6, 2023

View reviewed changes

sffc merged commit 6871b65 into unicode-org:main Apr 6, 2023

sffc deleted the segmenter-data-split branch April 6, 2023 16:14

sffc mentioned this pull request Apr 6, 2023

Consolidate functions in complex.rs #3274

Closed

Provide feedback