Reading GoogleNews word2vec model fails #198

victor-mlai · 2023-06-05T08:29:58Z

Trying to read the GoogleNews-vectors-negative300.bin word2vec model triggers this assert:
https://github.com/finalfusion/finalfusion-rust/blob/main/src/chunks/vocab/simple.rs#L28

thread 'main' panicked at 'assertion failed: `(left == right)`
  left: `3000000`,
  right: `2999997`: words contained duplicate entries.'

(when constructing a new simple vocabulary, the number of indices (3,000,000) ends up different than the number of words (2,999,997))

After some investigations I removed this word trimming and it worked fine afterwards:
https://github.com/finalfusion/finalfusion-rust/blob/main/src/compat/word2vec.rs#L98

I assume the model contains tokens that get trimmed into the same words.

Should I create a pull request to remove this line? Or is there something I'm doing wrong?

The model I used is from: https://code.google.com/archive/p/word2vec
Code:

let mut reader = BufReader::new(File::open("GoogleNews-vectors-negative300.bin").unwrap());
let model = Embeddings::read_word2vec_binary(&mut reader).unwrap();

The text was updated successfully, but these errors were encountered:

danieldk · 2023-12-18T13:38:22Z

Thank you for reporting this and sorry for the late reply. I think we added the trimming as a precaution and it should probably be removed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reading GoogleNews word2vec model fails #198

Reading GoogleNews word2vec model fails #198

victor-mlai commented Jun 5, 2023 •

edited

Loading

danieldk commented Dec 18, 2023

Reading GoogleNews word2vec model fails #198

Reading GoogleNews word2vec model fails #198

Comments

victor-mlai commented Jun 5, 2023 • edited Loading

danieldk commented Dec 18, 2023

victor-mlai commented Jun 5, 2023 •

edited

Loading