Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Load full native fastText Facebook model is partial #2969

Open
aviclu opened this issue Sep 30, 2020 · 16 comments
Open

Load full native fastText Facebook model is partial #2969

aviclu opened this issue Sep 30, 2020 · 16 comments
Assignees
Labels
bug Issue described a bug impact MEDIUM Big annoyance for affected users reach MEDIUM Affects a significant number of users

Comments

@aviclu
Copy link

aviclu commented Sep 30, 2020

Problem description

Hidden vectors are bad. I'm using the gensim.models.fasttext.load_facebook_model function to load the .bin file, but the syn1 fails loading. Also trainables.syn1neg is full of zeros.

'FastTextTrainables' object has no attribute 'syn1'

Steps/code/corpus to reproduce

Simply using ft = gensim.models.fasttext.load_facebook_model(fname) on Facebook's model.
Then ft.syn1 or ft.trainables.syn1neg which returns the zero array.

Versions

Please provide the output of:
Windows-2012ServerR2-6.3.9600-SP0
Python 3.7.1 (default, Dec 10 2018, 22:54:23) [MSC v.1915 64 bit (AMD64)]
Bits 64
NumPy 1.18.3
SciPy 1.4.1
gensim 3.8.3
FAST_VERSION 0

@gojomo
Copy link
Collaborator

gojomo commented Sep 30, 2020

Are you using a particular public model, and if so, which one? Alternatively, if using a private model, with what parameters was it trained?

@aviclu
Copy link
Author

aviclu commented Sep 30, 2020

@gojomo I'm using the official crawl-300d-2M-subword.bin file, which I downloaded from https://fasttext.cc/docs/en/english-vectors.html

@gojomo
Copy link
Collaborator

gojomo commented Sep 30, 2020

Thanks. There would only be one of either syn1 or syn1neg - but whichever loads, should have non-zero values.

Was anything anomalous displayed during load, especially if setting global logging level to DEBUG?

@gojomo
Copy link
Collaborator

gojomo commented Oct 1, 2020

I've confirmed that even in our pre-4.0.0 develop branch, which has a lot of FastText fixes (but nothing specifically touching load_facebook_model), loading that model results in an all-zeros syn1neg.

It looks like @mpenkov added the load_facebook_model entry point in #2376, but it wholly depends on the earlier _load_fasttext_format function also by @menshikh-iv.

Are we sure this ever worked? Is there a chance the file itself has zeros? (Trying load_facebook_model(datapath('lee_fasttext_new.bin')), a toy-sized testing file checked in unit tests, does show non-zeros in the model's syn1neg.)

@mpenkov mpenkov added the bug Issue described a bug label Oct 1, 2020
@mpenkov mpenkov changed the title [BUG] Load full native fastText Facebook model is partial Load full native fastText Facebook model is partial Oct 1, 2020
@piskvorky piskvorky added impact MEDIUM Big annoyance for affected users reach MEDIUM Affects a significant number of users labels Oct 1, 2020
@gojomo
Copy link
Collaborator

gojomo commented Dec 8, 2020

From a quick scan of tests in test_fasttext.py, I don't see anything that does a meaningful test of the results of load_facebook_model() other than just the loaded vectors. (That is: nothing to test that which makes load_facebook_model different from load_facebook_vectors.)

There is one attempted roundtrip test, if the native FT_HOME directory is available, in SaveFacebookByteIdentityTest. But that directory isn't usually available, so I'm unsure if/ever this was working.

It's likely load_facebook_model doesn't work at all for its intended purpose.

@piskvorky
Copy link
Owner

piskvorky commented Dec 9, 2020

Marking this as blocking for 4.0.0 – CC @mpenkov can you check?

@piskvorky piskvorky added this to the 4.0.0 milestone Dec 9, 2020
@mpenkov mpenkov self-assigned this Dec 18, 2020
@mpenkov
Copy link
Collaborator

mpenkov commented Feb 26, 2021

  • Reproduce problem with toy dataset (matrix is not all zero, as reported by @gojomo)
  • Reproduce problem with real dataset (matrix is all zero, as reported by @aviclu and @gojomo)
  • Run round-trip test that was relying on FT_HOME (test passes)
  • Anything else?

@mpenkov
Copy link
Collaborator

mpenkov commented Feb 26, 2021

import gensim.models.fasttext
import gensim.test.utils
path = gensim.test.utils.datapath('lee_fasttext_new.bin')
model = gensim.models.fasttext.load_facebook_model(path)
print(model.syn1neg)

Gives:

array([[ 0.27832156,  0.15093271, -0.05810147, ...,  0.20399494,
         0.10794587, -0.17611295],
       [ 0.04015477,  0.2320431 , -0.31041363, ...,  0.07040029,
         0.17735204, -0.23731148],
       [ 0.33127972, -0.08667868, -0.1704444 , ...,  0.20603168,
         0.11391634, -0.15840392],
       ...,
       [ 0.17141579,  0.02448652, -0.14411658, ..., -0.07036947,
         0.4076898 , -0.33286095],
       [ 0.09963796,  0.09554827, -0.1726573 , ..., -0.11196624,
         0.25655633, -0.24722196],
       [ 0.16295125, -0.02737397, -0.12545614, ..., -0.00165336,
         0.31274942, -0.20620131]], dtype=float32)

@gojomo
Copy link
Collaborator

gojomo commented Feb 26, 2021

Indeed, that load (from a tiny file in the test directory of unclear vintage) gives a syn1neg value that looks correct, as noted in my comment of 2020-09-30.

The report is of zeros when loading a large full model from Facebook - specifically crawl-300d-2M-subword.bin.

@mpenkov
Copy link
Collaborator

mpenkov commented Feb 26, 2021

Yeah, I had to leave it loading overnight. And yes, I get the same results as you. So now we're on the same page.

import sys
import gensim.models.fasttext
path = sys.argv[1]
model = gensim.models.fasttext.load_facebook_model(path)
print(model.syn1neg)
$ time python repr_real.py ~/Downloads/crawl-300d-2M-subword/crawl-300d-2M-subword.bin
[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]

real    30m43.069s
user    3m28.906s
sys     3m2.842s

@mpenkov
Copy link
Collaborator

mpenkov commented Feb 26, 2021

I had a closer look a that file (crawl-300d-2M-subword.bin). At the end of the file, where we expect the hidden layer to be, there's a bunch of zeros.

import collections
import io
import gensim.models._fasttext_bin

path = '/Users/misha/Downloads/crawl-300d-2M-subword/crawl-300d-2M-subword.bin'
seek_pos = 4835845135  # obtained via pdb
with open(path, 'rb') as fin:
    fin.seek(seek_pos)
    matrix_bytes = fin.read()
    fin.seek(seek_pos)
    matrix = gensim.models._fasttext_bin._load_matrix(fin, new_format=True)

print(matrix)

counter = collections.Counter()
counter.update(matrix_bytes)
print(counter)

I got the seek position by inserting a breakpoint into the loading code here.

$ time python repr_readmatrix.py 
[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]
Counter({0: 2400000012, 128: 1, 132: 1, 30: 1, 44: 1, 1: 1})

real    2m59.471s
user    2m43.491s
sys     0m7.710s

Our code correctly interprets that as a (2M x 300) matrix of zeros.

I can think of two explanations for this.

  1. Something changed in the model format and we haven't been keeping up. I think we may need to revisit the format of their model files.
  2. Their model is buggy. It's unlikely we (or anybody else) can extract anything other than zeros from a slab of bytes that is 99.99% zero.

@gojomo @piskvorky Which do you think is the more likely explanation? Could there be another?

@gojomo
Copy link
Collaborator

gojomo commented Feb 27, 2021

That's suspicious, as I'd not expect any large ranges-of-zero-vectors in a truly saved model. Maybe, point out the oddity & ask at the FacebookResearch Fasttext project issues? Devise a differential test that'd work well with a real syn1neg but poorly or not-at-all with an uninitialized layer? (I'm having a hard time thinking of a stark, compact test. Any effect might be most evident in a -supervised mode model - which that file isn't, and perhaps files in that mode might be saving the 'right' things even if this file isn't.)

@mpenkov
Copy link
Collaborator

mpenkov commented Feb 28, 2021

Should we still treat this as a blocker for 4.0.0?

@mpenkov
Copy link
Collaborator

mpenkov commented Mar 9, 2021

I doesn't look like the FB guys will examine this anytime soon, so I suggest we remove this from the milestone and move on with the release.

@mpenkov
Copy link
Collaborator

mpenkov commented Mar 15, 2021

@piskvorky Removing this from the milestone as discussed during our last meeting. Please let me know if I've misunderstood.

@mpenkov mpenkov removed this from the 4.0.0 milestone Mar 15, 2021
@piskvorky
Copy link
Owner

Yes, thanks. If it's really a bug with the FB model, not much we can do about it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue described a bug impact MEDIUM Big annoyance for affected users reach MEDIUM Affects a significant number of users
Projects
None yet
Development

No branches or pull requests

4 participants