AssertionError: unexpected number of vectors when loading Korean FB model #2402

zzaebok · 2019-03-06T09:20:25Z

Hi,

I downloaded pretrained word vector file (.bin) from facebook
(https://fasttext.cc/docs/en/crawl-vectors.html)
However, when I tried to use this model it happens to make error.

from gensim.models import FastText
fasttext_model = FastText.load_fasttext_format('cc.ko.300.bin', encoding='utf8')

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 0: invalid continuation byte

But weird thing is that it operates well when I use old version bin file (https://fasttext.cc/docs/en/pretrained-vectors.html)

So I tried to find the solution and found that this problem had happend and solved.

The issue was made through facebook fasttext issue
(facebookresearch/fastText#715)
And they fixed it by
(facebookresearch/fastText@e13484b)

So, I think gensim load_fasttext_format function had this UnicodeDecodeError because of above problem.

Can you help me to find and solve this problem?

I tried changing
word = word_bytes.decode(encoding, errors='replace')
word = word_bytes.decode(encoding, errors='ignore')
in gensim\models_fasttext_bin.py , line 177
But both made same error
File "C:\Users\User\PycharmProjects\ksenticnet\venv\lib\site-packages\gensim\models\keyedvectors.py", line 2207, in init_post_load
assert vectors.shape[0] == vocab_words + self.bucket, 'unexpected number of vectors'
AssertionError: unexpected number of vectors

Versions

Windows-10-10.0.17134-SP0
Python 3.6.5 (v3.6.5:f59c0932b4, Mar 28 2018, 17:00:18) [MSC v.1900 64 bit (AMD64)]
NumPy 1.16.2
SciPy 1.2.1
gensim 3.7.1
FAST_VERSION 0

The text was updated successfully, but these errors were encountered:

mpenkov · 2019-03-07T03:26:09Z

I think this may have been fixed as part of another issue.

Could you please have a look here: #2378 (comment) and let me know if that fixes your problem?

zzaebok · 2019-03-07T04:44:29Z

@mpenkov
Thanks for your replying.
So I've tried open the unicode branch you said in that comment, 404 Not Found error occurs.
And also, I changed _fasttext_bin.py file as 6dc4aef says.

for i in range(vocab_size):
    word_bytes = io.BytesIO()
    char_byte = fin.read(1)

    while char_byte != b'\x00':
        word_bytes.write(char_byte)
        char_byte = fin.read(1)
    word_bytes = word_bytes.getvalue()

    word = word_bytes.decode(encoding)
    count, _ = _struct_unpack(fin, '@qb')
    raw_vocab[word] = count

However, the error still occurs.

mpenkov · 2019-03-07T06:16:31Z

The reason you're getting a 404 looking for that branch is because that PR got merged.

Please try again with the current develop branch and let me know.

(I think the change to _fasttext_bin.py you described is irrelevant to the problem).

zzaebok · 2019-03-07T13:31:51Z

@mpenkov
There is no more unicode decode error, but

  Traceback (most recent call last):
  File "C:/Users/사용자/PycharmProjects/Ksenticnet/lab.py", line 16, in <module>
    fasttext_model = FastText.load_fasttext_format('cc.ko.300.bin', encoding='utf8')
  File "C:\Users\사용자\PycharmProjects\Ksenticnet\venv\lib\site-packages\gensim\utils.py", line 1447, in new_func1
    return func(*args, **kwargs)
  File "C:\Users\사용자\PycharmProjects\Ksenticnet\venv\lib\site-packages\gensim\models\fasttext.py", line 965, in load_fasttext_format
    return load_facebook_model(model_file, encoding=encoding)
  File "C:\Users\사용자\PycharmProjects\Ksenticnet\venv\lib\site-packages\gensim\models\fasttext.py", line 1232, in load_facebook_model
    return _load_fasttext_format(path, encoding=encoding, full_model=True)
  File "C:\Users\사용자\PycharmProjects\Ksenticnet\venv\lib\site-packages\gensim\models\fasttext.py", line 1344, in _load_fasttext_format
    model.wv.init_post_load(m.vectors_ngrams)
  File "C:\Users\사용자\PycharmProjects\Ksenticnet\venv\lib\site-packages\gensim\models\keyedvectors.py", line 2235, in init_post_load
    assert vectors.shape[0] == vocab_words + self.bucket, 'unexpected number of vectors'
AssertionError: unexpected number of vectors

It happened.

mpenkov · 2019-03-07T15:06:36Z

OK, thank you. I will try reproducing this locally and get back to you with more info.

mpenkov · 2019-03-09T13:09:01Z

OK, I've successfully reproduced the error.

mpenkov · 2019-03-09T13:23:50Z

There's an inconsistency in the model.

(Pdb) p vectors
array([[-2.1880191e-02,  2.4097128e-02,  2.4378835e-01, ...,
        -2.3358483e-02, -4.7362827e-02,  4.2944144e-02],
       [ 1.8272826e-02, -1.2648515e-01, -1.3920291e-02, ...,
         3.3092845e-02, -7.3114678e-02, -4.6189796e-02],
       [-8.3837859e-02, -2.9104140e-02,  5.9615869e-02, ...,
        -7.6464862e-03,  5.7216208e-02,  5.0551396e-02],
       ...,
       [-4.7130170e-03, -1.0642191e-02,  2.4006031e-02, ...,
        -4.8146462e-03,  4.6943068e-03,  9.7508440e-03],
       [ 1.2094555e-02, -3.5290685e-03, -6.1084521e-03, ...,
        -5.9561329e-03,  2.0649242e-04,  8.6368741e-03],
       [-2.1612773e-02, -9.9795591e-04,  6.5802750e-03, ...,
         1.5136318e-02, -3.0362314e-02,  3.0855754e-02]], dtype=float32)
(Pdb) p vectors.shape
(4000000, 300)
(Pdb) p vocab_words
1999989
(Pdb) p self.bucket
2000000

The model reports that it has 1999989 words and 200000 buckets, meaning it must have 3999989 vectors (each word and bucket corresponds to a vector). However, the actual collection of vectors (the matrix) contains 400000 vectors (11 extra vectors).

The reason we're missing 11 vectors is broken unicode. If we enable logging, we see this:

ERROR:gensim.models._fasttext_bin:unable to cleanly decode bytes b'\xed\xa1\x9c' to word ''
ERROR:gensim.models._fasttext_bin:unable to cleanly decode bytes b'\xed\xb0\x80' to word ''
ERROR:gensim.models._fasttext_bin:unable to cleanly decode bytes b'\xed\xa7\x80' to word ''
ERROR:gensim.models._fasttext_bin:unable to cleanly decode bytes b'\xed\xb8\xb0' to word ''
ERROR:gensim.models._fasttext_bin:unable to cleanly decode bytes b'\xed\xa0\x9c' to word ''
ERROR:gensim.models._fasttext_bin:unable to cleanly decode bytes b'\xed\xa6\xac' to word ''
ERROR:gensim.models._fasttext_bin:unable to cleanly decode bytes b'\xed\xb3\xb5' to word ''
ERROR:gensim.models._fasttext_bin:unable to cleanly decode bytes b'\xed\xb3\xb4' to word ''
ERROR:gensim.models._fasttext_bin:unable to cleanly decode bytes b'\xed\xa5\xbc' to word ''
ERROR:gensim.models._fasttext_bin:unable to cleanly decode bytes b'\xed\xb3\xa0' to word ''
ERROR:gensim.models._fasttext_bin:unable to cleanly decode bytes b'\xed\xa0\x84' to word ''
ERROR:gensim.models._fasttext_bin:unable to cleanly decode bytes b'\xed\xa0\x95' to word ''

There are 12 vocab terms that have broken Unicode. We handle that by ignoring any bad characters. Unfortunately, this sets us up for a collision, because all the above vectors map to the same string (empty string, in this case).

@piskvorky How should we deal with this case? It seems like the model from FB may be broken - perhaps we should reach out to them for more info? The fact that older models work fine (see original post) suggest that something fishy is going on.

piskvorky · 2019-03-09T16:23:44Z

Sure, we can report this to FB. But do they actually have a contract in place that models contain valid utf8-encoded Unicode?

IIRC, the original word2vec simply treated strings as binary bytes… (sometimes cutting utf8 multi-byte characters in half when trimming length etc). Inability to decode the strings was treated as "your problem, for us it's just bytes".

zzaebok · 2019-03-13T06:48:43Z

@mpenkov
I tried your PR and got this error

    Traceback (most recent call last):
  File "data_helper.py", line 133, in <module>
    build_fasttext('cc.ko.300.bin', 'embeddings/context_embeddings.npz', 'embeddings/target_embeddings.npz', word_d
ict, verb_dict, 300)
  File "data_helper.py", line 24, in build_fasttext
    fasttext_model = FastText.load_fasttext_format(filename, encoding='utf8')
  File "/home/jaebok123/.local/lib/python3.5/site-packages/gensim/utils.py", line 1447, in new_func1
    return func(*args, **kwargs)
  File "/home/jaebok123/.local/lib/python3.5/site-packages/gensim/models/fasttext.py", line 965, in load_fasttext_f
ormat
    return load_facebook_model(model_file, encoding=encoding)
  File "/home/jaebok123/.local/lib/python3.5/site-packages/gensim/models/fasttext.py", line 1243, in load_facebook_
model
    return _load_fasttext_format(path, encoding=encoding, full_model=True)
  File "/home/jaebok123/.local/lib/python3.5/site-packages/gensim/models/fasttext.py", line 1336, in _load_fasttext
_format
    max_n=m.maxn,
  File "/home/jaebok123/.local/lib/python3.5/site-packages/gensim/models/fasttext.py", line 593, in __init__
    self.trainables.prepare_weights(hs, negative, self.wv, update=False, vocabulary=self.vocabulary)
  File "/home/jaebok123/.local/lib/python3.5/site-packages/gensim/models/fasttext.py", line 1123, in prepare_weight
s
    self.init_ngrams_weights(wv, update=update, vocabulary=vocabulary)
  File "/home/jaebok123/.local/lib/python3.5/site-packages/gensim/models/fasttext.py", line 1143, in init_ngrams_we
ights
    wv.init_ngrams_weights(self.seed)
  File "/home/jaebok123/.local/lib/python3.5/site-packages/gensim/models/keyedvectors.py", line 2187, in init_ngram
s_weights
    self.vectors_ngrams = rand_obj.uniform(lo, hi, ngrams_shape).astype(REAL)
MemoryError

Does this occur because of shortage of RAM???
I have 15GB memory on Google Cloud Platform, and cc.ko.300.bin is about 8GB.
Is it able to happen??

Thank you anyway :)

mpenkov · 2019-03-13T08:17:09Z

It does indeed look like you've run out of memory. In my tests, loading that model required around 20GB of RAM (this decreases to around 15GB after the model loads and gets cleaned up).

If you don't intend on continuing to train the model, use the load_facebook_vectors method (see the change log on master for details).

zzaebok · 2019-03-13T12:08:47Z

@mpenkov
I don't know how to use load_facebook_vectors method, so I just made new VM instance with 26GB memory. And finally got succeeded loading pretrained vectors. Really thank you.
But, can I ask you what the change is??
I made embedding lookup table from fasttext model, but some of the vectors are represented as [nan, nan, nan ... , nan] What is the role of 'backslashreplace' during decoding?

mpenkov · 2019-03-13T12:57:13Z

But, can I ask you what the change is??

We keep vocabulary terms in a dictionary. The previous decoder mapped different byte strings to the same unicode string, causing a collision. This meant the vocabulary contained fewer terms that it should, breaking the model.

I made embedding lookup table from fasttext model, but some of the vectors are represented as [nan, nan, nan ... , nan]

Please show examples, I'll have a look.

What is the role of 'backslashreplace' during decoding?

Please see the standard library reference for the codecs module.

jayantj · 2019-03-16T08:02:17Z

Hi, I may have worked on this part of the codebase a long time ago, I'm not well-aware of recent changes and discussions around it. But would a reasonable solution simply be to use latin1 encoding or something similar which essentially keeps the bytestrings as is? It bypasses the collision problem with bytestrings being mapped to the same unicode string too, if I understand correctly.

mpenkov · 2019-03-18T04:50:38Z

From the POV of loading code, yes, encoding as latin1 will work around the UnicodeError.

(dbi2) misha@cabron:~/pn/dbi2$ python
Python 3.6.5 (default, Apr  1 2018, 05:46:30) 
[GCC 7.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> b = b'\xed\xa1\x9c'
>>> b.decode('latin1')
'í¡\x9c'
>>> b.decode('latin1').encode('latin1')
b'\xed\xa1\x9c'

It may be a better alternative than backslashreplace.

However, the code that computes ngram hashes will first encode those strings as utf-8. This means it will not be hashing the original byte values, and the ngram hashes will be incorrect (different to those in the reference implementation).

zzaebok · 2019-03-18T07:17:24Z

@mpenkov
Some vectors of Korean words returned as [nan nan nan ... nan]
For example,

from gensim.models import FastText
fm = FastText.load_fasttext_format('cc.ko.300.bin', encoding='utf8')
fm.wv['굽히']

And these are the words that makes the same warnings ( a lot more )

굽히
받치
일말
흉계
괄시
저간
궁색

mpenkov · 2019-03-18T07:33:15Z

Do other Korean words work OK?

zzaebok · 2019-03-18T13:40:51Z

@mpenkov
Among 8000 words I have, around 50 words look like that.
Rest 7950 words work okay.

mpenkov · 2019-03-23T04:13:27Z

@zzaebok The reference implementation from FB returns an origin vector for the words you specified:

mpenkov@hetrad2:~/data/ko$ cat examples.txt 
굽히
받치
일말
흉계
괄시
저간
궁색
mpenkov@hetrad2:~/data/ko$ cat examples.txt | ~/git/fastText/fasttext print-word-vectors cc.ko.300.bin 
굽히 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
받치 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
일말 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
흉계 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
괄시 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
저간 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
궁색 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Could you please confirm this behavior? What is special about those words? Are they somehow rare (Google Translate seems to think they are not).

I think the reason why the vectors are zero is the model does not have vectors for any ngrams for those words. The words are short (2 characters each), so the number of possible ngrams is 3 (A, B, AB) - if all 3 are absent from the model, you'll get a vector pointing to the origin.

We will update gensim to behave the same way.

zzaebok · 2019-04-01T05:44:33Z

@mpenkov
Sorry for late reply.
They are quite rare words as you said. I think FB made fasttext based on n-gram (n>=3, I am not sure because they are saying 5>=grams on English version) so I cannot get word vectors from them. I just thought that I can resolve the OOV problem with FB fasttext because its old version did solve it.
Anyway, you mean that newly updated version of gensim will supports the vectors pointing origin, not 'nan' ??
Thank you very much

mpenkov · 2019-04-02T04:33:51Z

Anyway, you mean that newly updated version of gensim will supports the vectors pointing origin, not 'nan' ??

Yes.

I'm reopening this ticket for our own records. We will close it when the corresponding branch is merged.

mpenkov self-assigned this Mar 7, 2019

mpenkov added bug Issue described a bug 3.7.2 labels Mar 7, 2019

mpenkov changed the title ~~'load_fasttext_format' UnicodeDecodeError bug~~ AssertionError: unexpected number of vectors when loading Korean FB model Mar 9, 2019

mpenkov mentioned this issue Mar 10, 2019

avoid collisions when decoding bad unicode #2411

Merged

zzaebok closed this as completed Apr 1, 2019

mpenkov reopened this Apr 2, 2019

mpenkov closed this as completed in #2411 Apr 6, 2019

rianrajagede mentioned this issue Apr 13, 2019

loading fastText model trained with pretrained_vectors still fails (see: #2350) #2378

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AssertionError: unexpected number of vectors when loading Korean FB model #2402

AssertionError: unexpected number of vectors when loading Korean FB model #2402

zzaebok commented Mar 6, 2019 •

edited

Loading

mpenkov commented Mar 7, 2019

zzaebok commented Mar 7, 2019

mpenkov commented Mar 7, 2019

zzaebok commented Mar 7, 2019 •

edited by mpenkov

Loading

mpenkov commented Mar 7, 2019

mpenkov commented Mar 9, 2019

mpenkov commented Mar 9, 2019 •

edited by piskvorky

Loading

piskvorky commented Mar 9, 2019 •

edited

Loading

zzaebok commented Mar 13, 2019 •

edited by mpenkov

Loading

mpenkov commented Mar 13, 2019

zzaebok commented Mar 13, 2019

mpenkov commented Mar 13, 2019

jayantj commented Mar 16, 2019

mpenkov commented Mar 18, 2019

zzaebok commented Mar 18, 2019 •

edited by mpenkov

Loading

mpenkov commented Mar 18, 2019

zzaebok commented Mar 18, 2019 •

edited

Loading

mpenkov commented Mar 23, 2019

zzaebok commented Apr 1, 2019

mpenkov commented Apr 2, 2019

AssertionError: unexpected number of vectors when loading Korean FB model #2402

AssertionError: unexpected number of vectors when loading Korean FB model #2402

Comments

zzaebok commented Mar 6, 2019 • edited Loading

Versions

mpenkov commented Mar 7, 2019

zzaebok commented Mar 7, 2019

mpenkov commented Mar 7, 2019

zzaebok commented Mar 7, 2019 • edited by mpenkov Loading

mpenkov commented Mar 7, 2019

mpenkov commented Mar 9, 2019

mpenkov commented Mar 9, 2019 • edited by piskvorky Loading

piskvorky commented Mar 9, 2019 • edited Loading

zzaebok commented Mar 13, 2019 • edited by mpenkov Loading

mpenkov commented Mar 13, 2019

zzaebok commented Mar 13, 2019

mpenkov commented Mar 13, 2019

jayantj commented Mar 16, 2019

mpenkov commented Mar 18, 2019

zzaebok commented Mar 18, 2019 • edited by mpenkov Loading

mpenkov commented Mar 18, 2019

zzaebok commented Mar 18, 2019 • edited Loading

mpenkov commented Mar 23, 2019

zzaebok commented Apr 1, 2019

mpenkov commented Apr 2, 2019

zzaebok commented Mar 6, 2019 •

edited

Loading

zzaebok commented Mar 7, 2019 •

edited by mpenkov

Loading

mpenkov commented Mar 9, 2019 •

edited by piskvorky

Loading

piskvorky commented Mar 9, 2019 •

edited

Loading

zzaebok commented Mar 13, 2019 •

edited by mpenkov

Loading

zzaebok commented Mar 18, 2019 •

edited by mpenkov

Loading

zzaebok commented Mar 18, 2019 •

edited

Loading