-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Overflow error after unicode errors when loading a 'large' model built with gensim #2950
Comments
You say that the model was created with Gensim; how was it initially saved? If saved via If you saved it via Gensim's
|
|
Yes - as training ultimately only adjusts the numbers in the already-allocated arrays, it shouldn't be implicated in any save/load errors triggered by strings/string-encodings/model-extent/etc. (If correctness/usefulness of results was implicated, or it was some error like a crash only triggered during training, that'd be different.) If size is the key factor, than a large synthetic corpus generated y a tiny amount of code (similar to this one in a previous issue) may be sufficient to trigger. (Not necessarily a Meanwhile, if the unicode errors have anything to do with the actual text content, they might be triggerable with a toy-sized corpus of a few tokens using the same text. Similarly, with the unicode errors, it'd be interesting to take any suspect corpus and try: (1) train/save in original FB |
Problem description
What are you trying to achieve?
I am loading a
fasttext
model built withgensim
, usinggensim.models.fasttext.load_facebook_model
so I can use the model.What is the expected result?
The model loads correctly.
What are you seeing instead?
Overflow error, preceded by unicode parsing errors.
Steps/code/corpus to reproduce
I get an overflow error when I try to load a
fasttext
model which I built withgensim
. I have tried with versions 3.8.3 and then rebuild and load with the head of the code 4.0.0-dev as of yesterday. It's not reproducible because I cannot share the corpus.Here is the stack trace:
count
variable is calculated ascount = num_vectors * dim
. Both of these are astronomical at 10^23,dim
should be 100, so there must be some unpacking problem here already. The unpacking of model params pre vocab look ok.fasttext
module, so I have a workaround.The count of the erroneous words are also off the scale:
I saw that there were many changes from
int
tolong long
both in 3.8.3 and also in 4.0.0-dev so my hypothesis was that it would be resolved when updating but I got the same error.I don't know if this is sufficient information to go in in order to pin it down, please let me know if I can help with more information.
Versions
Please provide the output of:
The text was updated successfully, but these errors were encountered: