-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AssertionError: unexpected number of vectors when loading Korean FB model #2402
Comments
I think this may have been fixed as part of another issue. Could you please have a look here: #2378 (comment) and let me know if that fixes your problem? |
@mpenkov
However, the error still occurs. |
The reason you're getting a 404 looking for that branch is because that PR got merged. Please try again with the current (I think the change to _fasttext_bin.py you described is irrelevant to the problem). |
@mpenkov
It happened. |
OK, thank you. I will try reproducing this locally and get back to you with more info. |
OK, I've successfully reproduced the error. |
There's an inconsistency in the model.
The model reports that it has 1999989 words and 200000 buckets, meaning it must have 3999989 vectors (each word and bucket corresponds to a vector). However, the actual collection of vectors (the matrix) contains 400000 vectors (11 extra vectors). The reason we're missing 11 vectors is broken unicode. If we enable logging, we see this:
There are 12 vocab terms that have broken Unicode. We handle that by ignoring any bad characters. Unfortunately, this sets us up for a collision, because all the above vectors map to the same string (empty string, in this case). @piskvorky How should we deal with this case? It seems like the model from FB may be broken - perhaps we should reach out to them for more info? The fact that older models work fine (see original post) suggest that something fishy is going on. |
Sure, we can report this to FB. But do they actually have a contract in place that models contain valid utf8-encoded Unicode? IIRC, the original word2vec simply treated strings as binary bytes… (sometimes cutting utf8 multi-byte characters in half when trimming length etc). Inability to decode the strings was treated as "your problem, for us it's just bytes". |
@mpenkov
Does this occur because of shortage of RAM??? Thank you anyway :) |
It does indeed look like you've run out of memory. In my tests, loading that model required around 20GB of RAM (this decreases to around 15GB after the model loads and gets cleaned up). If you don't intend on continuing to train the model, use the load_facebook_vectors method (see the change log on master for details). |
@mpenkov |
We keep vocabulary terms in a dictionary. The previous decoder mapped different byte strings to the same unicode string, causing a collision. This meant the vocabulary contained fewer terms that it should, breaking the model.
Please show examples, I'll have a look.
Please see the standard library reference for the codecs module. |
Hi, I may have worked on this part of the codebase a long time ago, I'm not well-aware of recent changes and discussions around it. But would a reasonable solution simply be to use |
From the POV of loading code, yes, encoding as latin1 will work around the UnicodeError. (dbi2) misha@cabron:~/pn/dbi2$ python
Python 3.6.5 (default, Apr 1 2018, 05:46:30)
[GCC 7.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> b = b'\xed\xa1\x9c'
>>> b.decode('latin1')
'í¡\x9c'
>>> b.decode('latin1').encode('latin1')
b'\xed\xa1\x9c' It may be a better alternative than backslashreplace. However, the code that computes ngram hashes will first encode those strings as utf-8. This means it will not be hashing the original byte values, and the ngram hashes will be incorrect (different to those in the reference implementation). |
@mpenkov from gensim.models import FastText
fm = FastText.load_fasttext_format('cc.ko.300.bin', encoding='utf8')
fm.wv['굽히'] And these are the words that makes the same warnings ( a lot more ) 굽히 |
Do other Korean words work OK? |
@mpenkov |
@zzaebok The reference implementation from FB returns an origin vector for the words you specified:
Could you please confirm this behavior? What is special about those words? Are they somehow rare (Google Translate seems to think they are not). I think the reason why the vectors are zero is the model does not have vectors for any ngrams for those words. The words are short (2 characters each), so the number of possible ngrams is 3 (A, B, AB) - if all 3 are absent from the model, you'll get a vector pointing to the origin. We will update gensim to behave the same way. |
@mpenkov |
Yes. I'm reopening this ticket for our own records. We will close it when the corresponding branch is merged. |
Hi,
I downloaded pretrained word vector file (.bin) from facebook
(https://fasttext.cc/docs/en/crawl-vectors.html)
However, when I tried to use this model it happens to make error.
But weird thing is that it operates well when I use old version bin file (https://fasttext.cc/docs/en/pretrained-vectors.html)
So I tried to find the solution and found that this problem had happend and solved.
The issue was made through facebook fasttext issue
(facebookresearch/fastText#715)
And they fixed it by
(facebookresearch/fastText@e13484b)
So, I think gensim load_fasttext_format function had this UnicodeDecodeError because of above problem.
Can you help me to find and solve this problem?
I tried changing
word = word_bytes.decode(encoding, errors='replace')
word = word_bytes.decode(encoding, errors='ignore')
in gensim\models_fasttext_bin.py , line 177
But both made same error
File "C:\Users\User\PycharmProjects\ksenticnet\venv\lib\site-packages\gensim\models\keyedvectors.py", line 2207, in init_post_load
assert vectors.shape[0] == vocab_words + self.bucket, 'unexpected number of vectors'
AssertionError: unexpected number of vectors
Versions
Windows-10-10.0.17134-SP0
Python 3.6.5 (v3.6.5:f59c0932b4, Mar 28 2018, 17:00:18) [MSC v.1900 64 bit (AMD64)]
NumPy 1.16.2
SciPy 1.2.1
gensim 3.7.1
FAST_VERSION 0
The text was updated successfully, but these errors were encountered: