Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clean up FastText Cython code, fix division by zero #2382

Merged
merged 27 commits into from
May 4, 2019
Merged

Clean up FastText Cython code, fix division by zero #2382

merged 27 commits into from
May 4, 2019

Conversation

mpenkov
Copy link
Collaborator

@mpenkov mpenkov commented Feb 14, 2019

The main goal of this PR is to fix #2377

I found the original Cython code hard to debug, so I improved it by moving long code blocks out to separate functions and introducing better variable names. Functionally, it's the same, but it's a bit easier to read, and it helped me uncover the zero division problem.

The problem is bad initialization of L1 working space (infinity)

(gensim) misha@cabron:~/git/gensim$ python bug.py
INFO:gensim.models.word2vec:resetting layer weights
INFO:gensim.models.word2vec:collecting all words and their counts
WARNING:gensim.models.word2vec:Each 'sentences' item should be a list of words (usually unicode strings). First item here is instead plain <class 'str'>.
INFO:gensim.models.word2vec:PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
INFO:gensim.models.word2vec:collected 53 word types from a corpus of 14201 raw words and 10 sentences
INFO:gensim.models.word2vec:Loading a fresh vocabulary
INFO:gensim.models.word2vec:effective_min_count=5 retains 39 unique words (73% of original 53, drops 14)
INFO:gensim.models.word2vec:effective_min_count=5 leaves 14173 word corpus (99% of original 14201, drops 28)
INFO:gensim.models.word2vec:deleting the raw counts dictionary of 53 items
INFO:gensim.models.word2vec:sample=0.001 downsamples 26 most-common words
INFO:gensim.models.word2vec:downsampling leaves estimated 2588 word corpus (18.3% of prior 14173)
INFO:gensim.models.word2vec:constructing a huffman tree from 39 words
INFO:gensim.models.word2vec:built huffman tree with maximum node depth 11
INFO:gensim.models.fasttext:estimated required memory for 39 words, 0 buckets and 100 dimensions: 75972 bytes
INFO:gensim.models.word2vec:resetting layer weights
INFO:gensim.models.base_any2vec:training model with 3 workers on 39 vocabulary and 100 features, using sg=1 hs=1 sample=0.001 negative=5 window=5
l1: -inf -inf inf inf -inf -inf inf inf -inf inf inf inf -inf -inf -inf -inf -inf inf inf inf -inf -inf -inf inf inf -inf -inf inf inf -inf inf inf -inf inf inf -inf -inf inf -inf -inf inf -inf inf inf inf inf -inf inf inf inf inf inf inf inf inf inf inf -inf inf inf -inf -inf -inf inf inf inf -inf -inf inf -inf -inf -inf -inf -inf -inf -inf inf inf inf -inf inf -inf -inf -inf -inf inf inf -inf inf inf -inf -inf inf -inf inf inf inf -inf inf inf
syn1[3700]: 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
-6 -nan 6
Segmentation fault (core dumped)
@mpenkov mpenkov requested a review from piskvorky February 14, 2019 13:17
@mpenkov mpenkov changed the title WIP: Avoid segfault when training skipgram model Avoid segfault when training skipgram model Feb 17, 2019
@mpenkov mpenkov changed the title Avoid segfault when training skipgram model Clean up FastText Cython code, fix division by zero Feb 17, 2019
@mpenkov
Copy link
Collaborator Author

mpenkov commented Feb 21, 2019

@piskvorky We have several options of proceeding with this PR.

  1. Include in the next bugfix release (3.7.2)
  2. Wait until next minor release (3.8.0)
  3. Split the bug fix from the Cython improvements, release the former in 3.7.2, and the latter in 3.8.0

WDYT? Any preferences?

@mpenkov mpenkov added the 3.7.2 label Feb 21, 2019
@piskvorky
Copy link
Owner

piskvorky commented Feb 21, 2019

I don't understand the trade-offs well enough. Can I leave this decision with you @mpenkov ?

@mpenkov mpenkov removed the 3.7.2 label Mar 7, 2019
@mpenkov
Copy link
Collaborator Author

mpenkov commented Mar 7, 2019

I'm gonna go with option 3). It fixes the immediate bug, and avoids the risk of introducing new ones.

Opened #2404

We'll merge the Cython improvements after the bugfix release.

@mpenkov mpenkov merged commit b18eeb2 into piskvorky:develop May 4, 2019
@mpenkov mpenkov deleted the segfault branch May 4, 2019 12:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

FastText segfaults for some ngram ranges
2 participants