-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vocab.load_vectors_from_bin_loc does not import vectors #856
Comments
Hey, Looking at the code here, I think this function is assuming the vocabulary is loaded already, and we're just adding the vectors. This means it's failing to create vocabulary entries. Since you're starting with an empty vocab, you're not adding any vectors :(. I think this has happened because I've made several edits to this method to try to improve the load time. Long story short: if you first load the vocab, it should work as expected. If all you have is the list of strings, you could do: for string in strings:
_ = vocab[string] To create an entry in the vocab for that string. I think the function should probably be changed so that words which have a vector listed get added to the vocabulary. It seems very unlikely that the current behaviour is what a user would want. |
@honnibal Hi, ok I see. So what are the next steps to integrate word embeddings for a new language in SpaCy? |
Hm, did I answer this on Gitter already, or are you waiting on an answer still? Sorry, losing track a little bit! |
I posted this message before asking on SpaCy, so it's good for me ;). By the way, I'm running a bit out of time lately to work on SpaCy, but integrating the French word2vec model is still on my todo list :) |
Fixed on master, so closing this issue! |
The issue is still present on master. It works if we load both the stringStore (as indicated by @honnibal) and the lexemes (with Vocab.load_lexemes method). |
Closing this and making #1046 the master issue. Work in progress for spaCy v2.0! |
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
I've trained a Word2vec model for the French language using gensim, and I'm trying to integrate it to SpaCy. I've successfully loaded the text vector file with SpaCy using
Vocab.load_vectors
. However, after dumping (withVocab.dump_vectors
) and loading (withVocab.load_vectors_from_bin_loc
) vectors, the token vectors are allnp.zeros
.This script reproduces the bug:
Output:
The word2vec file is quite large, but I can send it if needed
Your Environment
The text was updated successfully, but these errors were encountered: