Support datagen with dictionary for some locales and LSTM for others #2905

sffc · 2022-12-20T15:29:07Z

Currently we support km, lo, my, and th for LSTM, and those four plus a single unified CJK for Dictionary. Until we have an ML model working for CJK, users may wish to use LSTM for the four SEA languages and Dictionary for CJK. However, datagen does not currently have the ability to perform this type of filtering in a single go.

A few ways to support this:

Add key-specific locale filtering
Make datagen aware of the segmentation data with a special flag (similar to what we do for collations)
Add more general logic to support locales in one key if available and fallback to another key otherwise, and utilize it for the segmenter

Also, datagen doesn't know whether users have try_new_lstm + try_new_dictionary or try_new_automatic at their call sites. I lean toward making this automatic segmenter filtering in datagen the default behavior (drop dictionaries when LSTM is available), but that may cause unexpected behavior when try_new_dictionary is directly invoked.

The text was updated successfully, but these errors were encountered:

sffc · 2023-01-05T19:24:59Z

Discussion:

@sffc - I prefer this being a dedicated datagen flag for segmenter. Behavior: If both LSTM and Dictionary keys are requested, then use a heuristic to include exactly one per language. If LSTM is requested without Dictionary, unconditionally include LSTM for all locales. If Dictionary is requested without LSTM, unconditionally include Dictionary for all locales. There is a flag like --segmenter-include to tweak this behavior.
@younies - But this causes problems if one part of the code requires dictionary and another part requires LSTM?
@sffc - Yes. I think we can justify that by saying that it's already the case that Dictionary or LSTM may not be defined in all languages.
@Manishearth - Note that we're getting more and more configurable in datagen. There is a point where we should think about these holistically.

sffc · 2023-01-25T00:19:27Z

Another snag I remembered: there are multiple versions of the LSTM model that trade size for accuracy.

sffc · 2023-01-25T00:24:46Z

I think we should default to the "medium accuracy" model in datagen. In order to retain the uniqueness invariant of resource paths, this can be an "auxiliary key", as proposed for currency in #1441 (comment).

Using the model names from https://github.com/unicode-org/lstm_word_segmentation, the paths would be

segmenter/lstm@1/th/model4
segmenter/lstm@1/th/model5
segmenter/lstm@1/th/model7

At runtime, we would try to load models in order, starting with the most accurate down to the least accurate.

sffc · 2023-01-25T00:25:10Z

CC @robertbastian

sffc · 2023-01-25T02:45:15Z

For distinguishing between _lstm+_dictionary and _auto: a fairly clean way to do this is to make the keys look something like

segmenter/lstm@1/... for try_new_lstm_unstable
segmenter/dictionary@1/... for try_new_dictionary_unstable
segmenter/auto/lstm@1/... and segmenter/auto/dictionary@1/... for try_new_auto_unstable

and rely on data deduplication. This works in databake and postcard, but not in fs, and requires a little bit of extra logic for it to work in a data cache / web request. The principal advantage is that it entirely eliminates the requirement that we add the extra flag to datagen, because we can determine based on the keys being used how we want to build the data.

@Manishearth suggested that datagen can print a warning if both auto and non-auto keys are present at the same time to alert clients that they may be doing something wrong.

robertbastian · 2023-01-25T12:29:28Z

I don't think we should rely on cross-key data deduplication. While BlobDataProvider supports this, many adapters will not, and even for databake it's only best effort (it won't happen across codegen units).

sffc · 2023-01-25T15:29:59Z

Acknowledged. On the other hand, the cross-key data deduplication is an optimization; worst case you carry more data than you need, and people who care about data size can be careful to not combine auto and non-auto keys (Manish's proposed warning can help).

Another option in the same vein is that the non-auto constructors could add an extra key like segmenter/lstm/full@1 that doesn't hold any data but informs datagen to include full lstm data.

Manishearth · 2023-01-25T15:59:51Z

I don't think we should rely on cross-key data deduplication. While BlobDataProvider supports this, many adapters will not, and even for databake it's only best effort (it won't happen across codegen units).

So this is also my position in general, except here it's going to be pretty rare that you have both auto and non-auto keys in use.

The additional key was definitely something I was considering too, though.

sffc · 2023-02-22T03:55:05Z

Discussion with @robertbastian @Manishearth :

We can give hints to datagen by adding additional strings to the executable inside of the constructors without needing additional bounds on those constructors.

In the base case, the hints can look like key strings so that they get picked up by the default keyextract infrastructure. However, we prefer a solution where the hints are separate from key strings; keyextract produces a structured key file, and the datagen API needs to be able to accept it.

sffc · 2023-03-08T01:24:39Z

@robertbastian Are you planning to add the new config options to the datagen API? I think that blocks this issue.

sffc · 2023-03-23T16:40:24Z

It occurred to me that we missed out on the most simple and clean solution for this problem. We use the separate keys as suggested in #2905 (comment), but we can do it without any data deduplication with the following change:

try_new_auto requires segmenter/lstm/auto@1 and segmenter/lstm/dictionary@1
try_new_dictionary requires segmenter/dictionary/ext@1 and segmenter/dictionary/auto@1
try_new_lstm requires segmenter/lstm/ext@1 and segmenter/lstm/auto@1

And then the data blobs live in the keys as follows

segmenter/lstm/auto contains th, km, lo, my
segmenter/lstm/ext is empty for now
segmenter/dictionary/auto contains jp
segmenter/dictionary/ext contains th, km, lo, my

sffc · 2023-03-23T17:18:02Z

Discussion: move forward with the above. Don't make the empty lstm key until we need it.

sffc added T-core Type: Required functionality C-data-infra Component: provider, datagen, fallback, adapters S-medium Size: Less than a week (larger bug fix or enhancement) labels Dec 20, 2022

sffc added this to the ICU4X 1.2 milestone Dec 20, 2022

sffc self-assigned this Dec 20, 2022

sffc added the C-segmentation Component: Segmentation label Dec 20, 2022

sffc mentioned this issue Dec 20, 2022

Move Segmenter to Components #2259

Closed

22 tasks

sffc added the discuss-priority Discuss at the next ICU4X meeting label Dec 21, 2022

sffc removed the discuss-priority Discuss at the next ICU4X meeting label Jan 5, 2023

sffc mentioned this issue Jan 25, 2023

How to handle non-orthogonal data #3022

Open

sffc added the discuss-priority Discuss at the next ICU4X meeting label Jan 25, 2023

sffc mentioned this issue Jan 25, 2023

Provide data for Bidi pairing of brackets #3030

Closed

sffc removed the discuss-priority Discuss at the next ICU4X meeting label Mar 23, 2023

sffc mentioned this issue Apr 5, 2023

Apply new structure to public Dictionary/LSTM data markers #3267

Merged

sffc closed this as completed in #3267 Apr 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support datagen with dictionary for some locales and LSTM for others #2905

Support datagen with dictionary for some locales and LSTM for others #2905

sffc commented Dec 20, 2022 •

edited

Loading

sffc commented Jan 5, 2023 •

edited

Loading

sffc commented Jan 25, 2023

sffc commented Jan 25, 2023

sffc commented Jan 25, 2023

sffc commented Jan 25, 2023 •

edited

Loading

robertbastian commented Jan 25, 2023

sffc commented Jan 25, 2023

Manishearth commented Jan 25, 2023

sffc commented Feb 22, 2023

sffc commented Mar 8, 2023

sffc commented Mar 23, 2023 •

edited

Loading

sffc commented Mar 23, 2023 •

edited

Loading

Support datagen with dictionary for some locales and LSTM for others #2905

Support datagen with dictionary for some locales and LSTM for others #2905

Comments

sffc commented Dec 20, 2022 • edited Loading

sffc commented Jan 5, 2023 • edited Loading

sffc commented Jan 25, 2023

sffc commented Jan 25, 2023

sffc commented Jan 25, 2023

sffc commented Jan 25, 2023 • edited Loading

robertbastian commented Jan 25, 2023

sffc commented Jan 25, 2023

Manishearth commented Jan 25, 2023

sffc commented Feb 22, 2023

sffc commented Mar 8, 2023

sffc commented Mar 23, 2023 • edited Loading

sffc commented Mar 23, 2023 • edited Loading

sffc commented Dec 20, 2022 •

edited

Loading

sffc commented Jan 5, 2023 •

edited

Loading

sffc commented Jan 25, 2023 •

edited

Loading

sffc commented Mar 23, 2023 •

edited

Loading

sffc commented Mar 23, 2023 •

edited

Loading