Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support datagen with dictionary for some locales and LSTM for others #2905

Closed
Tracked by #2259
sffc opened this issue Dec 20, 2022 · 12 comments · Fixed by #3267
Closed
Tracked by #2259

Support datagen with dictionary for some locales and LSTM for others #2905

sffc opened this issue Dec 20, 2022 · 12 comments · Fixed by #3267
Assignees
Labels
C-data-infra Component: provider, datagen, fallback, adapters C-segmentation Component: Segmentation S-medium Size: Less than a week (larger bug fix or enhancement) T-core Type: Required functionality

Comments

@sffc
Copy link
Member

sffc commented Dec 20, 2022

Currently we support km, lo, my, and th for LSTM, and those four plus a single unified CJK for Dictionary. Until we have an ML model working for CJK, users may wish to use LSTM for the four SEA languages and Dictionary for CJK. However, datagen does not currently have the ability to perform this type of filtering in a single go.

A few ways to support this:

  1. Add key-specific locale filtering
  2. Make datagen aware of the segmentation data with a special flag (similar to what we do for collations)
  3. Add more general logic to support locales in one key if available and fallback to another key otherwise, and utilize it for the segmenter

Also, datagen doesn't know whether users have try_new_lstm + try_new_dictionary or try_new_automatic at their call sites. I lean toward making this automatic segmenter filtering in datagen the default behavior (drop dictionaries when LSTM is available), but that may cause unexpected behavior when try_new_dictionary is directly invoked.

@sffc sffc added T-core Type: Required functionality C-data-infra Component: provider, datagen, fallback, adapters S-medium Size: Less than a week (larger bug fix or enhancement) labels Dec 20, 2022
@sffc sffc added this to the ICU4X 1.2 milestone Dec 20, 2022
@sffc sffc self-assigned this Dec 20, 2022
@sffc sffc added the C-segmentation Component: Segmentation label Dec 20, 2022
@sffc sffc mentioned this issue Dec 20, 2022
22 tasks
@sffc sffc added the discuss-priority Discuss at the next ICU4X meeting label Dec 21, 2022
@sffc
Copy link
Member Author

sffc commented Jan 5, 2023

Discussion:

  • @sffc - I prefer this being a dedicated datagen flag for segmenter. Behavior: If both LSTM and Dictionary keys are requested, then use a heuristic to include exactly one per language. If LSTM is requested without Dictionary, unconditionally include LSTM for all locales. If Dictionary is requested without LSTM, unconditionally include Dictionary for all locales. There is a flag like --segmenter-include to tweak this behavior.
  • @younies - But this causes problems if one part of the code requires dictionary and another part requires LSTM?
  • @sffc - Yes. I think we can justify that by saying that it's already the case that Dictionary or LSTM may not be defined in all languages.
  • @Manishearth - Note that we're getting more and more configurable in datagen. There is a point where we should think about these holistically.

@sffc sffc removed the discuss-priority Discuss at the next ICU4X meeting label Jan 5, 2023
@sffc
Copy link
Member Author

sffc commented Jan 25, 2023

Another snag I remembered: there are multiple versions of the LSTM model that trade size for accuracy.

@sffc
Copy link
Member Author

sffc commented Jan 25, 2023

I think we should default to the "medium accuracy" model in datagen. In order to retain the uniqueness invariant of resource paths, this can be an "auxiliary key", as proposed for currency in #1441 (comment).

Using the model names from https://github.com/unicode-org/lstm_word_segmentation, the paths would be

  • segmenter/lstm@1/th/model4
  • segmenter/lstm@1/th/model5
  • segmenter/lstm@1/th/model7

At runtime, we would try to load models in order, starting with the most accurate down to the least accurate.

@sffc
Copy link
Member Author

sffc commented Jan 25, 2023

CC @robertbastian

@sffc
Copy link
Member Author

sffc commented Jan 25, 2023

For distinguishing between _lstm+_dictionary and _auto: a fairly clean way to do this is to make the keys look something like

  • segmenter/lstm@1/... for try_new_lstm_unstable
  • segmenter/dictionary@1/... for try_new_dictionary_unstable
  • segmenter/auto/lstm@1/... and segmenter/auto/dictionary@1/... for try_new_auto_unstable

and rely on data deduplication. This works in databake and postcard, but not in fs, and requires a little bit of extra logic for it to work in a data cache / web request. The principal advantage is that it entirely eliminates the requirement that we add the extra flag to datagen, because we can determine based on the keys being used how we want to build the data.

@Manishearth suggested that datagen can print a warning if both auto and non-auto keys are present at the same time to alert clients that they may be doing something wrong.

@sffc sffc added the discuss-priority Discuss at the next ICU4X meeting label Jan 25, 2023
@robertbastian
Copy link
Member

I don't think we should rely on cross-key data deduplication. While BlobDataProvider supports this, many adapters will not, and even for databake it's only best effort (it won't happen across codegen units).

@sffc
Copy link
Member Author

sffc commented Jan 25, 2023

Acknowledged. On the other hand, the cross-key data deduplication is an optimization; worst case you carry more data than you need, and people who care about data size can be careful to not combine auto and non-auto keys (Manish's proposed warning can help).

Another option in the same vein is that the non-auto constructors could add an extra key like segmenter/lstm/full@1 that doesn't hold any data but informs datagen to include full lstm data.

@Manishearth
Copy link
Member

I don't think we should rely on cross-key data deduplication. While BlobDataProvider supports this, many adapters will not, and even for databake it's only best effort (it won't happen across codegen units).

So this is also my position in general, except here it's going to be pretty rare that you have both auto and non-auto keys in use.

The additional key was definitely something I was considering too, though.

@sffc
Copy link
Member Author

sffc commented Feb 22, 2023

Discussion with @robertbastian @Manishearth :

We can give hints to datagen by adding additional strings to the executable inside of the constructors without needing additional bounds on those constructors.

In the base case, the hints can look like key strings so that they get picked up by the default keyextract infrastructure. However, we prefer a solution where the hints are separate from key strings; keyextract produces a structured key file, and the datagen API needs to be able to accept it.

@sffc
Copy link
Member Author

sffc commented Mar 8, 2023

@robertbastian Are you planning to add the new config options to the datagen API? I think that blocks this issue.

@sffc
Copy link
Member Author

sffc commented Mar 23, 2023

It occurred to me that we missed out on the most simple and clean solution for this problem. We use the separate keys as suggested in #2905 (comment), but we can do it without any data deduplication with the following change:

  1. try_new_auto requires segmenter/lstm/auto@1 and segmenter/lstm/dictionary@1
  2. try_new_dictionary requires segmenter/dictionary/ext@1 and segmenter/dictionary/auto@1
  3. try_new_lstm requires segmenter/lstm/ext@1 and segmenter/lstm/auto@1

And then the data blobs live in the keys as follows

  • segmenter/lstm/auto contains th, km, lo, my
  • segmenter/lstm/ext is empty for now
  • segmenter/dictionary/auto contains jp
  • segmenter/dictionary/ext contains th, km, lo, my

@sffc
Copy link
Member Author

sffc commented Mar 23, 2023

Discussion: move forward with the above. Don't make the empty lstm key until we need it.

@sffc sffc removed the discuss-priority Discuss at the next ICU4X meeting label Mar 23, 2023
@sffc sffc closed this as completed in #3267 Apr 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-data-infra Component: provider, datagen, fallback, adapters C-segmentation Component: Segmentation S-medium Size: Less than a week (larger bug fix or enhancement) T-core Type: Required functionality
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants