-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Load full native fastText Facebook model is partial #2969
Comments
Are you using a particular public model, and if so, which one? Alternatively, if using a private model, with what parameters was it trained? |
@gojomo I'm using the official |
Thanks. There would only be one of either Was anything anomalous displayed during load, especially if setting global logging level to DEBUG? |
I've confirmed that even in our pre-4.0.0 It looks like @mpenkov added the Are we sure this ever worked? Is there a chance the file itself has zeros? (Trying |
From a quick scan of tests in There is one attempted roundtrip test, if the native It's likely |
Marking this as blocking for 4.0.0 – CC @mpenkov can you check? |
import gensim.models.fasttext
import gensim.test.utils
path = gensim.test.utils.datapath('lee_fasttext_new.bin')
model = gensim.models.fasttext.load_facebook_model(path)
print(model.syn1neg) Gives:
|
Indeed, that load (from a tiny file in the test directory of unclear vintage) gives a The report is of zeros when loading a large full model from Facebook - specifically |
Yeah, I had to leave it loading overnight. And yes, I get the same results as you. So now we're on the same page. import sys
import gensim.models.fasttext
path = sys.argv[1]
model = gensim.models.fasttext.load_facebook_model(path)
print(model.syn1neg)
|
I had a closer look a that file (crawl-300d-2M-subword.bin). At the end of the file, where we expect the hidden layer to be, there's a bunch of zeros. import collections
import io
import gensim.models._fasttext_bin
path = '/Users/misha/Downloads/crawl-300d-2M-subword/crawl-300d-2M-subword.bin'
seek_pos = 4835845135 # obtained via pdb
with open(path, 'rb') as fin:
fin.seek(seek_pos)
matrix_bytes = fin.read()
fin.seek(seek_pos)
matrix = gensim.models._fasttext_bin._load_matrix(fin, new_format=True)
print(matrix)
counter = collections.Counter()
counter.update(matrix_bytes)
print(counter) I got the seek position by inserting a breakpoint into the loading code here.
Our code correctly interprets that as a (2M x 300) matrix of zeros. I can think of two explanations for this.
@gojomo @piskvorky Which do you think is the more likely explanation? Could there be another? |
That's suspicious, as I'd not expect any large ranges-of-zero-vectors in a truly saved model. Maybe, point out the oddity & ask at the FacebookResearch Fasttext project issues? Devise a differential test that'd work well with a real |
Should we still treat this as a blocker for 4.0.0? |
I doesn't look like the FB guys will examine this anytime soon, so I suggest we remove this from the milestone and move on with the release. |
@piskvorky Removing this from the milestone as discussed during our last meeting. Please let me know if I've misunderstood. |
Yes, thanks. If it's really a bug with the FB model, not much we can do about it. |
Problem description
Hidden vectors are bad. I'm using the gensim.models.fasttext.load_facebook_model function to load the .bin file, but the syn1 fails loading. Also trainables.syn1neg is full of zeros.
'FastTextTrainables' object has no attribute 'syn1'
Steps/code/corpus to reproduce
Simply using
ft = gensim.models.fasttext.load_facebook_model(fname)
on Facebook's model.Then
ft.syn1
orft.trainables.syn1neg
which returns the zero array.Versions
Please provide the output of:
Windows-2012ServerR2-6.3.9600-SP0
Python 3.7.1 (default, Dec 10 2018, 22:54:23) [MSC v.1915 64 bit (AMD64)]
Bits 64
NumPy 1.18.3
SciPy 1.4.1
gensim 3.8.3
FAST_VERSION 0
The text was updated successfully, but these errors were encountered: