Clean up FastText Cython code, fix division by zero #2382

mpenkov · 2019-02-14T13:07:16Z

The main goal of this PR is to fix #2377

I found the original Cython code hard to debug, so I improved it by moving long code blocks out to separate functions and introducing better variable names. Functionally, it's the same, but it's a bit easier to read, and it helped me uncover the zero division problem.

The problem is bad initialization of L1 working space (infinity) (gensim) misha@cabron:~/git/gensim$ python bug.py INFO:gensim.models.word2vec:resetting layer weights INFO:gensim.models.word2vec:collecting all words and their counts WARNING:gensim.models.word2vec:Each 'sentences' item should be a list of words (usually unicode strings). First item here is instead plain <class 'str'>. INFO:gensim.models.word2vec:PROGRESS: at sentence #0, processed 0 words, keeping 0 word types INFO:gensim.models.word2vec:collected 53 word types from a corpus of 14201 raw words and 10 sentences INFO:gensim.models.word2vec:Loading a fresh vocabulary INFO:gensim.models.word2vec:effective_min_count=5 retains 39 unique words (73% of original 53, drops 14) INFO:gensim.models.word2vec:effective_min_count=5 leaves 14173 word corpus (99% of original 14201, drops 28) INFO:gensim.models.word2vec:deleting the raw counts dictionary of 53 items INFO:gensim.models.word2vec:sample=0.001 downsamples 26 most-common words INFO:gensim.models.word2vec:downsampling leaves estimated 2588 word corpus (18.3% of prior 14173) INFO:gensim.models.word2vec:constructing a huffman tree from 39 words INFO:gensim.models.word2vec:built huffman tree with maximum node depth 11 INFO:gensim.models.fasttext:estimated required memory for 39 words, 0 buckets and 100 dimensions: 75972 bytes INFO:gensim.models.word2vec:resetting layer weights INFO:gensim.models.base_any2vec:training model with 3 workers on 39 vocabulary and 100 features, using sg=1 hs=1 sample=0.001 negative=5 window=5 l1: -inf -inf inf inf -inf -inf inf inf -inf inf inf inf -inf -inf -inf -inf -inf inf inf inf -inf -inf -inf inf inf -inf -inf inf inf -inf inf inf -inf inf inf -inf -inf inf -inf -inf inf -inf inf inf inf inf -inf inf inf inf inf inf inf inf inf inf inf -inf inf inf -inf -inf -inf inf inf inf -inf -inf inf -inf -inf -inf -inf -inf -inf -inf inf inf inf -inf inf -inf -inf -inf -inf inf inf -inf inf inf -inf -inf inf -inf inf inf inf -inf inf inf syn1[3700]: 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 -6 -nan 6 Segmentation fault (core dumped)

…_neg

…ow_hs

…ow_neg

mpenkov · 2019-02-21T01:25:54Z

@piskvorky We have several options of proceeding with this PR.

Include in the next bugfix release (3.7.2)
Wait until next minor release (3.8.0)
Split the bug fix from the Cython improvements, release the former in 3.7.2, and the latter in 3.8.0

WDYT? Any preferences?

piskvorky · 2019-02-21T10:23:44Z

I don't understand the trade-offs well enough. Can I leave this decision with you @mpenkov ?

mpenkov · 2019-03-07T04:01:22Z

I'm gonna go with option 3). It fixes the immediate bug, and avoids the risk of introducing new ones.

Opened #2404

We'll merge the Cython improvements after the bugfix release.

mpenkov added 13 commits February 14, 2019 22:08

minor refactoring in fasttext_inner.pyx

7b70ab6

refactor binary tree code

b1e4264

add harmless comment

870204b

improve sentence start/end variable names

f1a0268

improve window start/end variable names

6f29c75

rename item to randint

8acb9b5

refactor _prepare_ft_config function

a5b6e13

more reverse engineering of fasttext Cython code

c848f69

fix pointer dereference, foo[0].bar and foo.bar are the same thing

71087df

get rid of unused cdef variables

9d9e4d4

uncover zero division error

4652514

avoid division by zero

63d0c1f

mpenkov requested a review from piskvorky February 14, 2019 13:17

mpenkov added 11 commits February 15, 2019 10:05

do proper tokenization in bug.py

b453709

more Cython refactoring: less parameters to fasttext_fast_sentence_sg_hs

28ec151

more Cython refactoring: less parameters to fasttext_fast_sentence_sg…

aaffbfe

…_neg

remove unused variables

53deef3

more Cython refactoring: less parameters to fasttext_fast_sentence_cb…

fc52baa

…ow_hs

more Cython refactoring: less parameters to fasttext_fast_sentence_cb…

014f5c6

…ow_neg

introduce fasttext_train_any Cython cdef

b593749

replace copypasta with function calls

5278fb6

remove duplicated code

7c720b5

update documentation for Cython code

d68b359

remove unnecessary comments from Cython code

3267ee8

mpenkov changed the title ~~WIP: Avoid segfault when training skipgram model~~ Avoid segfault when training skipgram model Feb 17, 2019

mpenkov changed the title ~~Avoid segfault when training skipgram model~~ Clean up FastText Cython code, fix division by zero Feb 17, 2019

mpenkov added the 3.7.2 label Feb 21, 2019

mpenkov removed the 3.7.2 label Mar 7, 2019

mpenkov and others added 3 commits April 20, 2019 15:29

Merge remote-tracking branch 'upstream/develop' into segfault

dfbbd79

Delete bug.py

d6cfe9e

Merge remote-tracking branch 'upstream/develop' into segfault

8ca3e26

mpenkov merged commit b18eeb2 into piskvorky:develop May 4, 2019

mpenkov deleted the segfault branch May 4, 2019 12:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clean up FastText Cython code, fix division by zero #2382

Clean up FastText Cython code, fix division by zero #2382

mpenkov commented Feb 14, 2019

mpenkov commented Feb 21, 2019

piskvorky commented Feb 21, 2019 •

edited

Loading

mpenkov commented Mar 7, 2019

Clean up FastText Cython code, fix division by zero #2382

Clean up FastText Cython code, fix division by zero #2382

Conversation

mpenkov commented Feb 14, 2019

mpenkov commented Feb 21, 2019

piskvorky commented Feb 21, 2019 • edited Loading

mpenkov commented Mar 7, 2019

piskvorky commented Feb 21, 2019 •

edited

Loading