Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reading GoogleNews word2vec model fails #198

Open
victor-mlai opened this issue Jun 5, 2023 · 1 comment
Open

Reading GoogleNews word2vec model fails #198

victor-mlai opened this issue Jun 5, 2023 · 1 comment

Comments

@victor-mlai
Copy link

victor-mlai commented Jun 5, 2023

Trying to read the GoogleNews-vectors-negative300.bin word2vec model triggers this assert:
https://github.com/finalfusion/finalfusion-rust/blob/main/src/chunks/vocab/simple.rs#L28

thread 'main' panicked at 'assertion failed: `(left == right)`
  left: `3000000`,
  right: `2999997`: words contained duplicate entries.'

(when constructing a new simple vocabulary, the number of indices (3,000,000) ends up different than the number of words (2,999,997))

After some investigations I removed this word trimming and it worked fine afterwards:
https://github.com/finalfusion/finalfusion-rust/blob/main/src/compat/word2vec.rs#L98

I assume the model contains tokens that get trimmed into the same words.

Should I create a pull request to remove this line? Or is there something I'm doing wrong?

The model I used is from: https://code.google.com/archive/p/word2vec
Code:

let mut reader = BufReader::new(File::open("GoogleNews-vectors-negative300.bin").unwrap());
let model = Embeddings::read_word2vec_binary(&mut reader).unwrap();
@danieldk
Copy link
Member

Thank you for reporting this and sorry for the late reply. I think we added the trimming as a precaution and it should probably be removed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants