Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AssertionError: unexpected number of vectors when loading Korean FB model #2402

Closed
zzaebok opened this issue Mar 6, 2019 · 20 comments
Closed
Assignees
Labels
bug Issue described a bug

Comments

@zzaebok
Copy link

zzaebok commented Mar 6, 2019

Hi,

I downloaded pretrained word vector file (.bin) from facebook
(https://fasttext.cc/docs/en/crawl-vectors.html)
However, when I tried to use this model it happens to make error.

from gensim.models import FastText
fasttext_model = FastText.load_fasttext_format('cc.ko.300.bin', encoding='utf8')

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 0: invalid continuation byte

But weird thing is that it operates well when I use old version bin file (https://fasttext.cc/docs/en/pretrained-vectors.html)

So I tried to find the solution and found that this problem had happend and solved.

The issue was made through facebook fasttext issue
(facebookresearch/fastText#715)
And they fixed it by
(facebookresearch/fastText@e13484b)

So, I think gensim load_fasttext_format function had this UnicodeDecodeError because of above problem.

Can you help me to find and solve this problem?


I tried changing
word = word_bytes.decode(encoding, errors='replace')
word = word_bytes.decode(encoding, errors='ignore')
in gensim\models_fasttext_bin.py , line 177
But both made same error
File "C:\Users\User\PycharmProjects\ksenticnet\venv\lib\site-packages\gensim\models\keyedvectors.py", line 2207, in init_post_load
assert vectors.shape[0] == vocab_words + self.bucket, 'unexpected number of vectors'
AssertionError: unexpected number of vectors

Versions

Windows-10-10.0.17134-SP0
Python 3.6.5 (v3.6.5:f59c0932b4, Mar 28 2018, 17:00:18) [MSC v.1900 64 bit (AMD64)]
NumPy 1.16.2
SciPy 1.2.1
gensim 3.7.1
FAST_VERSION 0

@mpenkov
Copy link
Collaborator

mpenkov commented Mar 7, 2019

I think this may have been fixed as part of another issue.

Could you please have a look here: #2378 (comment) and let me know if that fixes your problem?

@zzaebok
Copy link
Author

zzaebok commented Mar 7, 2019

@mpenkov
Thanks for your replying.
So I've tried open the unicode branch you said in that comment, 404 Not Found error occurs.
And also, I changed _fasttext_bin.py file as 6dc4aef says.

for i in range(vocab_size):
    word_bytes = io.BytesIO()
    char_byte = fin.read(1)

    while char_byte != b'\x00':
        word_bytes.write(char_byte)
        char_byte = fin.read(1)
    word_bytes = word_bytes.getvalue()

    word = word_bytes.decode(encoding)
    count, _ = _struct_unpack(fin, '@qb')
    raw_vocab[word] = count

However, the error still occurs.

@mpenkov
Copy link
Collaborator

mpenkov commented Mar 7, 2019

The reason you're getting a 404 looking for that branch is because that PR got merged.

Please try again with the current develop branch and let me know.

(I think the change to _fasttext_bin.py you described is irrelevant to the problem).

@zzaebok
Copy link
Author

zzaebok commented Mar 7, 2019

@mpenkov
There is no more unicode decode error, but

  Traceback (most recent call last):
  File "C:/Users/사용자/PycharmProjects/Ksenticnet/lab.py", line 16, in <module>
    fasttext_model = FastText.load_fasttext_format('cc.ko.300.bin', encoding='utf8')
  File "C:\Users\사용자\PycharmProjects\Ksenticnet\venv\lib\site-packages\gensim\utils.py", line 1447, in new_func1
    return func(*args, **kwargs)
  File "C:\Users\사용자\PycharmProjects\Ksenticnet\venv\lib\site-packages\gensim\models\fasttext.py", line 965, in load_fasttext_format
    return load_facebook_model(model_file, encoding=encoding)
  File "C:\Users\사용자\PycharmProjects\Ksenticnet\venv\lib\site-packages\gensim\models\fasttext.py", line 1232, in load_facebook_model
    return _load_fasttext_format(path, encoding=encoding, full_model=True)
  File "C:\Users\사용자\PycharmProjects\Ksenticnet\venv\lib\site-packages\gensim\models\fasttext.py", line 1344, in _load_fasttext_format
    model.wv.init_post_load(m.vectors_ngrams)
  File "C:\Users\사용자\PycharmProjects\Ksenticnet\venv\lib\site-packages\gensim\models\keyedvectors.py", line 2235, in init_post_load
    assert vectors.shape[0] == vocab_words + self.bucket, 'unexpected number of vectors'
AssertionError: unexpected number of vectors

It happened.

@mpenkov
Copy link
Collaborator

mpenkov commented Mar 7, 2019

OK, thank you. I will try reproducing this locally and get back to you with more info.

@mpenkov mpenkov self-assigned this Mar 7, 2019
@mpenkov mpenkov added bug Issue described a bug 3.7.2 labels Mar 7, 2019
@mpenkov mpenkov changed the title 'load_fasttext_format' UnicodeDecodeError bug AssertionError: unexpected number of vectors when loading Korean FB model Mar 9, 2019
@mpenkov
Copy link
Collaborator

mpenkov commented Mar 9, 2019

OK, I've successfully reproduced the error.

@mpenkov
Copy link
Collaborator

mpenkov commented Mar 9, 2019

There's an inconsistency in the model.

(Pdb) p vectors
array([[-2.1880191e-02,  2.4097128e-02,  2.4378835e-01, ...,
        -2.3358483e-02, -4.7362827e-02,  4.2944144e-02],
       [ 1.8272826e-02, -1.2648515e-01, -1.3920291e-02, ...,
         3.3092845e-02, -7.3114678e-02, -4.6189796e-02],
       [-8.3837859e-02, -2.9104140e-02,  5.9615869e-02, ...,
        -7.6464862e-03,  5.7216208e-02,  5.0551396e-02],
       ...,
       [-4.7130170e-03, -1.0642191e-02,  2.4006031e-02, ...,
        -4.8146462e-03,  4.6943068e-03,  9.7508440e-03],
       [ 1.2094555e-02, -3.5290685e-03, -6.1084521e-03, ...,
        -5.9561329e-03,  2.0649242e-04,  8.6368741e-03],
       [-2.1612773e-02, -9.9795591e-04,  6.5802750e-03, ...,
         1.5136318e-02, -3.0362314e-02,  3.0855754e-02]], dtype=float32)
(Pdb) p vectors.shape
(4000000, 300)
(Pdb) p vocab_words
1999989
(Pdb) p self.bucket
2000000

The model reports that it has 1999989 words and 200000 buckets, meaning it must have 3999989 vectors (each word and bucket corresponds to a vector). However, the actual collection of vectors (the matrix) contains 400000 vectors (11 extra vectors).

The reason we're missing 11 vectors is broken unicode. If we enable logging, we see this:

ERROR:gensim.models._fasttext_bin:unable to cleanly decode bytes b'\xed\xa1\x9c' to word ''
ERROR:gensim.models._fasttext_bin:unable to cleanly decode bytes b'\xed\xb0\x80' to word ''
ERROR:gensim.models._fasttext_bin:unable to cleanly decode bytes b'\xed\xa7\x80' to word ''
ERROR:gensim.models._fasttext_bin:unable to cleanly decode bytes b'\xed\xb8\xb0' to word ''
ERROR:gensim.models._fasttext_bin:unable to cleanly decode bytes b'\xed\xa0\x9c' to word ''
ERROR:gensim.models._fasttext_bin:unable to cleanly decode bytes b'\xed\xa6\xac' to word ''
ERROR:gensim.models._fasttext_bin:unable to cleanly decode bytes b'\xed\xb3\xb5' to word ''
ERROR:gensim.models._fasttext_bin:unable to cleanly decode bytes b'\xed\xb3\xb4' to word ''
ERROR:gensim.models._fasttext_bin:unable to cleanly decode bytes b'\xed\xa5\xbc' to word ''
ERROR:gensim.models._fasttext_bin:unable to cleanly decode bytes b'\xed\xb3\xa0' to word ''
ERROR:gensim.models._fasttext_bin:unable to cleanly decode bytes b'\xed\xa0\x84' to word ''
ERROR:gensim.models._fasttext_bin:unable to cleanly decode bytes b'\xed\xa0\x95' to word ''

There are 12 vocab terms that have broken Unicode. We handle that by ignoring any bad characters. Unfortunately, this sets us up for a collision, because all the above vectors map to the same string (empty string, in this case).

@piskvorky How should we deal with this case? It seems like the model from FB may be broken - perhaps we should reach out to them for more info? The fact that older models work fine (see original post) suggest that something fishy is going on.

@piskvorky
Copy link
Owner

piskvorky commented Mar 9, 2019

Sure, we can report this to FB. But do they actually have a contract in place that models contain valid utf8-encoded Unicode?

IIRC, the original word2vec simply treated strings as binary bytes… (sometimes cutting utf8 multi-byte characters in half when trimming length etc). Inability to decode the strings was treated as "your problem, for us it's just bytes".

@zzaebok
Copy link
Author

zzaebok commented Mar 13, 2019

@mpenkov
I tried your PR and got this error

    Traceback (most recent call last):
  File "data_helper.py", line 133, in <module>
    build_fasttext('cc.ko.300.bin', 'embeddings/context_embeddings.npz', 'embeddings/target_embeddings.npz', word_d
ict, verb_dict, 300)
  File "data_helper.py", line 24, in build_fasttext
    fasttext_model = FastText.load_fasttext_format(filename, encoding='utf8')
  File "/home/jaebok123/.local/lib/python3.5/site-packages/gensim/utils.py", line 1447, in new_func1
    return func(*args, **kwargs)
  File "/home/jaebok123/.local/lib/python3.5/site-packages/gensim/models/fasttext.py", line 965, in load_fasttext_f
ormat
    return load_facebook_model(model_file, encoding=encoding)
  File "/home/jaebok123/.local/lib/python3.5/site-packages/gensim/models/fasttext.py", line 1243, in load_facebook_
model
    return _load_fasttext_format(path, encoding=encoding, full_model=True)
  File "/home/jaebok123/.local/lib/python3.5/site-packages/gensim/models/fasttext.py", line 1336, in _load_fasttext
_format
    max_n=m.maxn,
  File "/home/jaebok123/.local/lib/python3.5/site-packages/gensim/models/fasttext.py", line 593, in __init__
    self.trainables.prepare_weights(hs, negative, self.wv, update=False, vocabulary=self.vocabulary)
  File "/home/jaebok123/.local/lib/python3.5/site-packages/gensim/models/fasttext.py", line 1123, in prepare_weight
s
    self.init_ngrams_weights(wv, update=update, vocabulary=vocabulary)
  File "/home/jaebok123/.local/lib/python3.5/site-packages/gensim/models/fasttext.py", line 1143, in init_ngrams_we
ights
    wv.init_ngrams_weights(self.seed)
  File "/home/jaebok123/.local/lib/python3.5/site-packages/gensim/models/keyedvectors.py", line 2187, in init_ngram
s_weights
    self.vectors_ngrams = rand_obj.uniform(lo, hi, ngrams_shape).astype(REAL)
MemoryError

Does this occur because of shortage of RAM???
I have 15GB memory on Google Cloud Platform, and cc.ko.300.bin is about 8GB.
Is it able to happen??

Thank you anyway :)

@mpenkov
Copy link
Collaborator

mpenkov commented Mar 13, 2019

It does indeed look like you've run out of memory. In my tests, loading that model required around 20GB of RAM (this decreases to around 15GB after the model loads and gets cleaned up).

If you don't intend on continuing to train the model, use the load_facebook_vectors method (see the change log on master for details).

@zzaebok
Copy link
Author

zzaebok commented Mar 13, 2019

@mpenkov
I don't know how to use load_facebook_vectors method, so I just made new VM instance with 26GB memory. And finally got succeeded loading pretrained vectors. Really thank you.
But, can I ask you what the change is??
I made embedding lookup table from fasttext model, but some of the vectors are represented as [nan, nan, nan ... , nan] What is the role of 'backslashreplace' during decoding?

@mpenkov
Copy link
Collaborator

mpenkov commented Mar 13, 2019

But, can I ask you what the change is??

We keep vocabulary terms in a dictionary. The previous decoder mapped different byte strings to the same unicode string, causing a collision. This meant the vocabulary contained fewer terms that it should, breaking the model.

I made embedding lookup table from fasttext model, but some of the vectors are represented as [nan, nan, nan ... , nan]

Please show examples, I'll have a look.

What is the role of 'backslashreplace' during decoding?

Please see the standard library reference for the codecs module.

@jayantj
Copy link
Contributor

jayantj commented Mar 16, 2019

Hi, I may have worked on this part of the codebase a long time ago, I'm not well-aware of recent changes and discussions around it. But would a reasonable solution simply be to use latin1 encoding or something similar which essentially keeps the bytestrings as is? It bypasses the collision problem with bytestrings being mapped to the same unicode string too, if I understand correctly.

@mpenkov
Copy link
Collaborator

mpenkov commented Mar 18, 2019

From the POV of loading code, yes, encoding as latin1 will work around the UnicodeError.

(dbi2) misha@cabron:~/pn/dbi2$ python
Python 3.6.5 (default, Apr  1 2018, 05:46:30) 
[GCC 7.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> b = b'\xed\xa1\x9c'
>>> b.decode('latin1')
'í¡\x9c'
>>> b.decode('latin1').encode('latin1')
b'\xed\xa1\x9c'

It may be a better alternative than backslashreplace.

However, the code that computes ngram hashes will first encode those strings as utf-8. This means it will not be hashing the original byte values, and the ngram hashes will be incorrect (different to those in the reference implementation).

@zzaebok
Copy link
Author

zzaebok commented Mar 18, 2019

@mpenkov
Some vectors of Korean words returned as [nan nan nan ... nan]
For example,

from gensim.models import FastText
fm = FastText.load_fasttext_format('cc.ko.300.bin', encoding='utf8')
fm.wv['굽히']

image

And these are the words that makes the same warnings ( a lot more )

굽히
받치
일말
흉계
괄시
저간
궁색

@mpenkov
Copy link
Collaborator

mpenkov commented Mar 18, 2019

Do other Korean words work OK?

@zzaebok
Copy link
Author

zzaebok commented Mar 18, 2019

@mpenkov
Among 8000 words I have, around 50 words look like that.
Rest 7950 words work okay.

@mpenkov
Copy link
Collaborator

mpenkov commented Mar 23, 2019

@zzaebok The reference implementation from FB returns an origin vector for the words you specified:

mpenkov@hetrad2:~/data/ko$ cat examples.txt 
굽히
받치
일말
흉계
괄시
저간
궁색
mpenkov@hetrad2:~/data/ko$ cat examples.txt | ~/git/fastText/fasttext print-word-vectors cc.ko.300.bin 
굽히 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
받치 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
일말 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
흉계 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
괄시 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
저간 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
궁색 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 

Could you please confirm this behavior? What is special about those words? Are they somehow rare (Google Translate seems to think they are not).

I think the reason why the vectors are zero is the model does not have vectors for any ngrams for those words. The words are short (2 characters each), so the number of possible ngrams is 3 (A, B, AB) - if all 3 are absent from the model, you'll get a vector pointing to the origin.

We will update gensim to behave the same way.

@zzaebok
Copy link
Author

zzaebok commented Apr 1, 2019

@mpenkov
Sorry for late reply.
They are quite rare words as you said. I think FB made fasttext based on n-gram (n>=3, I am not sure because they are saying 5>=grams on English version) so I cannot get word vectors from them. I just thought that I can resolve the OOV problem with FB fasttext because its old version did solve it.
Anyway, you mean that newly updated version of gensim will supports the vectors pointing origin, not 'nan' ??
Thank you very much

@zzaebok zzaebok closed this as completed Apr 1, 2019
@mpenkov
Copy link
Collaborator

mpenkov commented Apr 2, 2019

Anyway, you mean that newly updated version of gensim will supports the vectors pointing origin, not 'nan' ??

Yes.

I'm reopening this ticket for our own records. We will close it when the corresponding branch is merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue described a bug
Projects
None yet
Development

No branches or pull requests

4 participants