Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vocab.load_vectors_from_bin_loc does not import vectors #856

Closed
raphael0202 opened this issue Feb 22, 2017 · 8 comments
Closed

Vocab.load_vectors_from_bin_loc does not import vectors #856

raphael0202 opened this issue Feb 22, 2017 · 8 comments
Labels
bug Bugs and behaviour differing from documentation

Comments

@raphael0202
Copy link
Contributor

I've trained a Word2vec model for the French language using gensim, and I'm trying to integrate it to SpaCy. I've successfully loaded the text vector file with SpaCy using Vocab.load_vectors. However, after dumping (with Vocab.dump_vectors) and loading (with Vocab.load_vectors_from_bin_loc) vectors, the token vectors are all np.zeros.

This script reproduces the bug:

# coding: utf-8

import spacy
from spacy.vocab import Vocab
from pathlib import Path

print(spacy.__file__)
print(spacy.__version__)

WORD2VEC_PATH = 'model.original_format'
WORD2VEC_DUMP_PATH = 'word2vec_spacy.fr.bin'

vocab = Vocab()

with open(WORD2VEC_PATH, 'r') as f:
    f.readline()
    print("Loading word embedding...")
    vocab.load_vectors(f)
    print("Loaded.")

print("Dumping vectors...")
vocab.dump_vectors(Path(WORD2VEC_DUMP_PATH))
print("Dumped.")
print(vocab['test'].vector)


vocab = Vocab()
vec_length = vocab.load_vectors_from_bin_loc(WORD2VEC_DUMP_PATH)
print("Vector length: {}".format(vec_length))
print(vocab['test'].vector)

Output:

/home/raphael/git_core/spaCy/spacy/__init__.py
1.6.0
Loading word embedding...
Loaded.
Dumping vectors...
Dumped.
[-0.019229    0.76435298  0.214569   -1.02372396  0.36121199  1.04253995
  2.00697589 -1.66889203 -0.42917499 -0.71605599  0.44025901  0.084247
  0.045775   -0.41794801 -0.74822599  0.171547   -0.028574    0.29178101
  0.95931101  0.23996601 -0.65064597  0.52666497  2.03168893 -0.70798498
  1.63285196  1.87963498 -0.29646099  1.95624697 -0.33722001 -0.804443
 -0.113575   -0.69362998 -0.212704    0.32264301 -0.810615   -0.32760301
  0.60244399 -2.32113695 -0.66723001  0.690925   -0.72073901 -0.107059
  1.09722304  1.48250699  0.26496401  0.048881   -0.82590598 -2.11811996
 -0.58993697 -1.27138698 -0.202759    1.93008304  0.62591898  1.23023999
  1.26697195  0.12528899 -0.503411    0.88590503  0.238004   -0.33974499
 -1.19306397 -0.67027998 -0.12648501 -0.29229599 -1.35156095 -1.91702402
 -0.76183099 -0.69461501 -0.68717301  0.79888999  1.349015    1.17856503
 -0.61917198  0.342765    0.89328098  1.07901001  0.65021402  0.315745
  0.74039102 -0.30671701 -0.519813    0.71764398  1.384215   -0.91375899
  1.48125696 -0.38761401  1.61084902 -2.48377204  3.03343391  0.069791
 -0.584732    1.47452796 -0.48844001  0.114689   -0.19256601 -0.249009
  1.08574903 -0.44485301 -1.47400606  1.17879605  0.056112    1.54470396
  0.51611    -1.69500101 -0.12818301  0.769252   -0.44041401 -0.35097399
  0.29551199 -0.92208099  0.688618    0.020455    0.104523   -0.707919
  1.80952597 -0.76321697 -0.261572   -0.78649199 -0.61826903  0.110683
  1.05416799  0.78717703  0.64711499  0.65555     1.03850198  0.50172001
  1.05923605 -0.019866   -2.62139797  1.036906   -0.49448699 -1.24969804
  0.123707    0.61285198  0.054703   -1.78264201 -1.19084001  0.57149899
 -0.20975199  0.755593    2.3780551   1.43021798 -0.447779   -0.090097
 -2.35513401 -0.099149    0.766424    1.62619698 -1.39099705 -0.48780099
 -1.87614405  0.35173699  0.446156    0.69945401 -1.16716397  2.65128708
  1.08807504 -1.22461295  0.47241199  0.418331   -0.24393199 -2.91613698
  2.00750804  0.73984498  0.181106    1.11167204  0.227009   -0.53869498
  1.59489596  2.35484791 -1.19441998 -0.183248   -2.18519902  0.444435
  0.016354    0.29998299  0.595272   -0.82985598 -1.14205098 -3.08867097
 -0.92190498  1.31091797 -0.42451501 -0.187921   -0.80356598 -0.055674
 -3.60187006  0.057543    0.93247098 -0.75598198  0.69832897  0.850236
 -1.56917202 -1.67239702  0.89910799 -0.031163    0.339849   -1.73267198
  0.466243    1.47330201 -2.05487394 -2.28297496  1.92062795  2.1347239
  0.711914   -2.90191388 -2.36000109  2.24561501 -1.70335698 -1.09882402
 -0.39223999 -0.464093   -0.296042    0.503896   -0.066985   -1.11051202
  0.34009299 -0.71949601  3.22786999 -1.64138198 -1.29591    -2.36355901
  1.63270104  1.06021798  0.35742    -0.350721   -3.10489297  0.059934
  0.526438    0.45713899 -1.60679197 -0.990493   -0.57441998 -1.49675202
 -1.01904297 -0.167955    1.16826797 -0.384839   -1.73141205 -0.15403201
 -1.18145394 -0.85733902  0.50105298 -1.61830103  0.432576   -1.555511
 -0.89789498  1.84602296  0.27828401  0.63889402 -0.404358   -0.22592901
  0.284996   -1.19782197  0.406023   -1.12461698  0.95940298  1.96275496
 -1.28945196 -1.83372903  0.93819398 -0.034911    0.130376   -0.17457999
  0.138551   -1.006163   -1.436854    2.93343902  0.66847003  0.036765
 -0.51748103 -0.96781403  0.61575598  0.59564602  0.078612   -0.58978701
  2.13165808  3.09175301  0.28941301  0.59303302  1.489182   -0.711595
  1.56517601 -1.11776698  1.37791097 -0.67366099 -0.94920802 -1.67869794
  0.67633897  2.02361202 -0.124076   -1.42339802 -0.729366   -0.73556799
 -0.016703   -0.17059     0.452564   -2.02800894  1.23816299 -0.75858003]
Vector length: 300
[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]

The word2vec file is quite large, but I can send it if needed

Your Environment

  • Operating System: Linux Ubuntu 16.04
  • Python Version Used: 3.5
  • spaCy Version Used: HEAD
@honnibal
Copy link
Member

honnibal commented Feb 22, 2017

Hey,

Looking at the code here, I think this function is assuming the vocabulary is loaded already, and we're just adding the vectors. This means it's failing to create vocabulary entries. Since you're starting with an empty vocab, you're not adding any vectors :(. I think this has happened because I've made several edits to this method to try to improve the load time.

Long story short: if you first load the vocab, it should work as expected. If all you have is the list of strings, you could do:

for string in strings:
    _ = vocab[string]

To create an entry in the vocab for that string.

I think the function should probably be changed so that words which have a vector listed get added to the vocabulary. It seems very unlikely that the current behaviour is what a user would want.

@honnibal honnibal added the bug Bugs and behaviour differing from documentation label Feb 22, 2017
@raphael0202
Copy link
Contributor Author

@honnibal Hi, ok I see. So what are the next steps to integrate word embeddings for a new language in SpaCy?

@honnibal
Copy link
Member

honnibal commented Mar 8, 2017

Hm, did I answer this on Gitter already, or are you waiting on an answer still? Sorry, losing track a little bit!

@raphael0202
Copy link
Contributor Author

I posted this message before asking on SpaCy, so it's good for me ;). By the way, I'm running a bit out of time lately to work on SpaCy, but integrating the French word2vec model is still on my todo list :)

@ines
Copy link
Member

ines commented Mar 18, 2017

Fixed on master, so closing this issue!

@ines ines closed this as completed Mar 18, 2017
@raphael0202
Copy link
Contributor Author

The issue is still present on master. It works if we load both the stringStore (as indicated by @honnibal) and the lexemes (with Vocab.load_lexemes method).

@ines
Copy link
Member

ines commented May 7, 2017

Closing this and making #1046 the master issue. Work in progress for spaCy v2.0!

@lock
Copy link

lock bot commented May 8, 2018

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators May 8, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Bugs and behaviour differing from documentation
Projects
None yet
Development

No branches or pull requests

3 participants