We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Trying to read the GoogleNews-vectors-negative300.bin word2vec model triggers this assert: https://github.com/finalfusion/finalfusion-rust/blob/main/src/chunks/vocab/simple.rs#L28
thread 'main' panicked at 'assertion failed: `(left == right)` left: `3000000`, right: `2999997`: words contained duplicate entries.'
(when constructing a new simple vocabulary, the number of indices (3,000,000) ends up different than the number of words (2,999,997))
After some investigations I removed this word trimming and it worked fine afterwards: https://github.com/finalfusion/finalfusion-rust/blob/main/src/compat/word2vec.rs#L98
I assume the model contains tokens that get trimmed into the same words.
Should I create a pull request to remove this line? Or is there something I'm doing wrong?
The model I used is from: https://code.google.com/archive/p/word2vec Code:
let mut reader = BufReader::new(File::open("GoogleNews-vectors-negative300.bin").unwrap()); let model = Embeddings::read_word2vec_binary(&mut reader).unwrap();
The text was updated successfully, but these errors were encountered:
Thank you for reporting this and sorry for the late reply. I think we added the trimming as a precaution and it should probably be removed.
Sorry, something went wrong.
No branches or pull requests
Trying to read the GoogleNews-vectors-negative300.bin word2vec model triggers this assert:
https://github.com/finalfusion/finalfusion-rust/blob/main/src/chunks/vocab/simple.rs#L28
(when constructing a new simple vocabulary, the number of indices (3,000,000) ends up different than the number of words (2,999,997))
After some investigations I removed this word trimming and it worked fine afterwards:
https://github.com/finalfusion/finalfusion-rust/blob/main/src/compat/word2vec.rs#L98
I assume the model contains tokens that get trimmed into the same words.
Should I create a pull request to remove this line? Or is there something I'm doing wrong?
The model I used is from: https://code.google.com/archive/p/word2vec
Code:
The text was updated successfully, but these errors were encountered: