Vocab.load_vectors_from_bin_loc does not import vectors #856

raphael0202 · 2017-02-22T09:57:04Z

I've trained a Word2vec model for the French language using gensim, and I'm trying to integrate it to SpaCy. I've successfully loaded the text vector file with SpaCy using Vocab.load_vectors. However, after dumping (with Vocab.dump_vectors) and loading (with Vocab.load_vectors_from_bin_loc) vectors, the token vectors are all np.zeros.

This script reproduces the bug:

# coding: utf-8

import spacy
from spacy.vocab import Vocab
from pathlib import Path

print(spacy.__file__)
print(spacy.__version__)

WORD2VEC_PATH = 'model.original_format'
WORD2VEC_DUMP_PATH = 'word2vec_spacy.fr.bin'

vocab = Vocab()

with open(WORD2VEC_PATH, 'r') as f:
    f.readline()
    print("Loading word embedding...")
    vocab.load_vectors(f)
    print("Loaded.")

print("Dumping vectors...")
vocab.dump_vectors(Path(WORD2VEC_DUMP_PATH))
print("Dumped.")
print(vocab['test'].vector)


vocab = Vocab()
vec_length = vocab.load_vectors_from_bin_loc(WORD2VEC_DUMP_PATH)
print("Vector length: {}".format(vec_length))
print(vocab['test'].vector)

Output:

/home/raphael/git_core/spaCy/spacy/__init__.py
1.6.0
Loading word embedding...
Loaded.
Dumping vectors...
Dumped.
[-0.019229    0.76435298  0.214569   -1.02372396  0.36121199  1.04253995
  2.00697589 -1.66889203 -0.42917499 -0.71605599  0.44025901  0.084247
  0.045775   -0.41794801 -0.74822599  0.171547   -0.028574    0.29178101
  0.95931101  0.23996601 -0.65064597  0.52666497  2.03168893 -0.70798498
  1.63285196  1.87963498 -0.29646099  1.95624697 -0.33722001 -0.804443
 -0.113575   -0.69362998 -0.212704    0.32264301 -0.810615   -0.32760301
  0.60244399 -2.32113695 -0.66723001  0.690925   -0.72073901 -0.107059
  1.09722304  1.48250699  0.26496401  0.048881   -0.82590598 -2.11811996
 -0.58993697 -1.27138698 -0.202759    1.93008304  0.62591898  1.23023999
  1.26697195  0.12528899 -0.503411    0.88590503  0.238004   -0.33974499
 -1.19306397 -0.67027998 -0.12648501 -0.29229599 -1.35156095 -1.91702402
 -0.76183099 -0.69461501 -0.68717301  0.79888999  1.349015    1.17856503
 -0.61917198  0.342765    0.89328098  1.07901001  0.65021402  0.315745
  0.74039102 -0.30671701 -0.519813    0.71764398  1.384215   -0.91375899
  1.48125696 -0.38761401  1.61084902 -2.48377204  3.03343391  0.069791
 -0.584732    1.47452796 -0.48844001  0.114689   -0.19256601 -0.249009
  1.08574903 -0.44485301 -1.47400606  1.17879605  0.056112    1.54470396
  0.51611    -1.69500101 -0.12818301  0.769252   -0.44041401 -0.35097399
  0.29551199 -0.92208099  0.688618    0.020455    0.104523   -0.707919
  1.80952597 -0.76321697 -0.261572   -0.78649199 -0.61826903  0.110683
  1.05416799  0.78717703  0.64711499  0.65555     1.03850198  0.50172001
  1.05923605 -0.019866   -2.62139797  1.036906   -0.49448699 -1.24969804
  0.123707    0.61285198  0.054703   -1.78264201 -1.19084001  0.57149899
 -0.20975199  0.755593    2.3780551   1.43021798 -0.447779   -0.090097
 -2.35513401 -0.099149    0.766424    1.62619698 -1.39099705 -0.48780099
 -1.87614405  0.35173699  0.446156    0.69945401 -1.16716397  2.65128708
  1.08807504 -1.22461295  0.47241199  0.418331   -0.24393199 -2.91613698
  2.00750804  0.73984498  0.181106    1.11167204  0.227009   -0.53869498
  1.59489596  2.35484791 -1.19441998 -0.183248   -2.18519902  0.444435
  0.016354    0.29998299  0.595272   -0.82985598 -1.14205098 -3.08867097
 -0.92190498  1.31091797 -0.42451501 -0.187921   -0.80356598 -0.055674
 -3.60187006  0.057543    0.93247098 -0.75598198  0.69832897  0.850236
 -1.56917202 -1.67239702  0.89910799 -0.031163    0.339849   -1.73267198
  0.466243    1.47330201 -2.05487394 -2.28297496  1.92062795  2.1347239
  0.711914   -2.90191388 -2.36000109  2.24561501 -1.70335698 -1.09882402
 -0.39223999 -0.464093   -0.296042    0.503896   -0.066985   -1.11051202
  0.34009299 -0.71949601  3.22786999 -1.64138198 -1.29591    -2.36355901
  1.63270104  1.06021798  0.35742    -0.350721   -3.10489297  0.059934
  0.526438    0.45713899 -1.60679197 -0.990493   -0.57441998 -1.49675202
 -1.01904297 -0.167955    1.16826797 -0.384839   -1.73141205 -0.15403201
 -1.18145394 -0.85733902  0.50105298 -1.61830103  0.432576   -1.555511
 -0.89789498  1.84602296  0.27828401  0.63889402 -0.404358   -0.22592901
  0.284996   -1.19782197  0.406023   -1.12461698  0.95940298  1.96275496
 -1.28945196 -1.83372903  0.93819398 -0.034911    0.130376   -0.17457999
  0.138551   -1.006163   -1.436854    2.93343902  0.66847003  0.036765
 -0.51748103 -0.96781403  0.61575598  0.59564602  0.078612   -0.58978701
  2.13165808  3.09175301  0.28941301  0.59303302  1.489182   -0.711595
  1.56517601 -1.11776698  1.37791097 -0.67366099 -0.94920802 -1.67869794
  0.67633897  2.02361202 -0.124076   -1.42339802 -0.729366   -0.73556799
 -0.016703   -0.17059     0.452564   -2.02800894  1.23816299 -0.75858003]
Vector length: 300
[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]

The word2vec file is quite large, but I can send it if needed

Your Environment

Operating System: Linux Ubuntu 16.04
Python Version Used: 3.5
spaCy Version Used: HEAD

The text was updated successfully, but these errors were encountered:

honnibal · 2017-02-22T10:32:01Z

Hey,

Looking at the code here, I think this function is assuming the vocabulary is loaded already, and we're just adding the vectors. This means it's failing to create vocabulary entries. Since you're starting with an empty vocab, you're not adding any vectors :(. I think this has happened because I've made several edits to this method to try to improve the load time.

Long story short: if you first load the vocab, it should work as expected. If all you have is the list of strings, you could do:

for string in strings:
    _ = vocab[string]

To create an entry in the vocab for that string.

I think the function should probably be changed so that words which have a vector listed get added to the vocabulary. It seems very unlikely that the current behaviour is what a user would want.

raphael0202 · 2017-02-22T17:01:38Z

@honnibal Hi, ok I see. So what are the next steps to integrate word embeddings for a new language in SpaCy?

honnibal · 2017-03-08T14:06:02Z

Hm, did I answer this on Gitter already, or are you waiting on an answer still? Sorry, losing track a little bit!

raphael0202 · 2017-03-10T07:58:26Z

I posted this message before asking on SpaCy, so it's good for me ;). By the way, I'm running a bit out of time lately to work on SpaCy, but integrating the French word2vec model is still on my todo list :)

ines · 2017-03-18T14:44:49Z

Fixed on master, so closing this issue!

raphael0202 · 2017-04-01T16:40:58Z

The issue is still present on master. It works if we load both the stringStore (as indicated by @honnibal) and the lexemes (with Vocab.load_lexemes method).

ines · 2017-05-07T22:33:10Z

Closing this and making #1046 the master issue. Work in progress for spaCy v2.0!

lock · 2018-05-08T21:39:09Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

honnibal added the bug Bugs and behaviour differing from documentation label Feb 22, 2017

ines closed this as completed Mar 18, 2017

ines reopened this Apr 7, 2017

honnibal mentioned this issue May 7, 2017

💫 Improve model saving and loading #1046

Closed

ines closed this as completed May 7, 2017

mraduldubey mentioned this issue Jul 5, 2017

💫 spaCy v2.0.0 alpha – details, feedback & questions (plus stickers!) #1105

Closed

lock bot locked as resolved and limited conversation to collaborators May 8, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vocab.load_vectors_from_bin_loc does not import vectors #856

Vocab.load_vectors_from_bin_loc does not import vectors #856

raphael0202 commented Feb 22, 2017

honnibal commented Feb 22, 2017 •

edited

Loading

raphael0202 commented Feb 22, 2017

honnibal commented Mar 8, 2017 •

edited

Loading

raphael0202 commented Mar 10, 2017

ines commented Mar 18, 2017

raphael0202 commented Apr 1, 2017

ines commented May 7, 2017

lock bot commented May 8, 2018

Vocab.load_vectors_from_bin_loc does not import vectors #856

Vocab.load_vectors_from_bin_loc does not import vectors #856

Comments

raphael0202 commented Feb 22, 2017

Your Environment

honnibal commented Feb 22, 2017 • edited Loading

raphael0202 commented Feb 22, 2017

honnibal commented Mar 8, 2017 • edited Loading

raphael0202 commented Mar 10, 2017

ines commented Mar 18, 2017

raphael0202 commented Apr 1, 2017

ines commented May 7, 2017

lock bot commented May 8, 2018

honnibal commented Feb 22, 2017 •

edited

Loading

honnibal commented Mar 8, 2017 •

edited

Loading