-
Notifications
You must be signed in to change notification settings - Fork 174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support datagen with dictionary for some locales and LSTM for others #2905
Comments
Discussion:
|
Another snag I remembered: there are multiple versions of the LSTM model that trade size for accuracy. |
I think we should default to the "medium accuracy" model in datagen. In order to retain the uniqueness invariant of resource paths, this can be an "auxiliary key", as proposed for currency in #1441 (comment). Using the model names from https://github.com/unicode-org/lstm_word_segmentation, the paths would be
At runtime, we would try to load models in order, starting with the most accurate down to the least accurate. |
For distinguishing between
and rely on data deduplication. This works in databake and postcard, but not in fs, and requires a little bit of extra logic for it to work in a data cache / web request. The principal advantage is that it entirely eliminates the requirement that we add the extra flag to datagen, because we can determine based on the keys being used how we want to build the data. @Manishearth suggested that datagen can print a warning if both |
I don't think we should rely on cross-key data deduplication. While |
Acknowledged. On the other hand, the cross-key data deduplication is an optimization; worst case you carry more data than you need, and people who care about data size can be careful to not combine auto and non-auto keys (Manish's proposed warning can help). Another option in the same vein is that the non-auto constructors could add an extra key like |
So this is also my position in general, except here it's going to be pretty rare that you have both auto and non-auto keys in use. The additional key was definitely something I was considering too, though. |
Discussion with @robertbastian @Manishearth : We can give hints to datagen by adding additional strings to the executable inside of the constructors without needing additional bounds on those constructors. In the base case, the hints can look like key strings so that they get picked up by the default keyextract infrastructure. However, we prefer a solution where the hints are separate from key strings; keyextract produces a structured key file, and the datagen API needs to be able to accept it. |
@robertbastian Are you planning to add the new config options to the datagen API? I think that blocks this issue. |
It occurred to me that we missed out on the most simple and clean solution for this problem. We use the separate keys as suggested in #2905 (comment), but we can do it without any data deduplication with the following change:
And then the data blobs live in the keys as follows
|
Discussion: move forward with the above. Don't make the empty lstm key until we need it. |
Currently we support km, lo, my, and th for LSTM, and those four plus a single unified CJK for Dictionary. Until we have an ML model working for CJK, users may wish to use LSTM for the four SEA languages and Dictionary for CJK. However, datagen does not currently have the ability to perform this type of filtering in a single go.
A few ways to support this:
Also, datagen doesn't know whether users have
try_new_lstm
+try_new_dictionary
ortry_new_automatic
at their call sites. I lean toward making this automatic segmenter filtering in datagen the default behavior (drop dictionaries when LSTM is available), but that may cause unexpected behavior whentry_new_dictionary
is directly invoked.The text was updated successfully, but these errors were encountered: