Doc2Vec.infer_vector learning rate decays extremely fast (non-linearly) #2061

umangv · 2018-05-25T15:00:57Z

I am working with a corpus of very short documents and noticed that the inferred vectors for the same document were very different.

from scipy.spatial.distance import pdist, squareform
testdoc = "This is a small sample document."
vectors = [d2vmod.infer_vector(testdoc) for _ in range(5)]
squareform(pdist(vectors, "cosine"))

array([[0.        , 0.05987812, 0.06183155, 0.06931093, 0.05466599],
       [0.05987812, 0.        , 0.03724874, 0.05006329, 0.04789369],
       [0.06183155, 0.03724874, 0.        , 0.04771786, 0.05983109],
       [0.06931093, 0.05006329, 0.04771786, 0.        , 0.0367826 ],
       [0.05466599, 0.04789369, 0.05983109, 0.0367826 , 0.        ]])

More training steps makes things worse in this case:

vectors = [d2vmod.infer_vector(testdoc, 10000) for _ in range(5)]
squareform(pdist(vectors, "cosine"))

array([[0.        , 0.27392197, 0.308742  , 0.51374501, 0.45744246],
       [0.27392197, 0.        , 0.14912033, 0.32902151, 0.1822687 ],
       [0.308742  , 0.14912033, 0.        , 0.2895444 , 0.27019636],
       [0.51374501, 0.32902151, 0.2895444 , 0.        , 0.38096254],
       [0.45744246, 0.1822687 , 0.27019636, 0.38096254, 0.        ]])

Note: This is more extreme than what I'm seeing with more domain-specific sample documents, where start to get more consistent after about 5000 steps.

I believe this is happening because the learning rate decays extremely rapidly:
https://github.com/RaRe-Technologies/gensim/blob/8b810918d59781116794a6679999afdc76b857ef/gensim/models/doc2vec.py#L565

alpha = 0.025
min_alpha = 0.001
steps = 100
for i in range(steps):
    print(alpha)
    alpha = ((alpha - min_alpha) / (steps - i)) + min_alpha

0.025
0.00124
0.0010024242424242424
0.0010000247371675943
...

Notice that alpha is very close to min_alpha after the first step and this is exaggerated even more when the number of steps is larger.

When I change Doc2Vec to have a linear decay in learning rate

alpha_delta = (alpha-min_alpha)/(steps-1)
for i in range(steps):
    # ...
    alpha -= alpha_delta

I get much better results. With 20 steps, we get pairwise cosine distances of

array([[0.        , 0.01617053, 0.02467067, 0.01828433, 0.01834735],
       [0.01617053, 0.        , 0.01879757, 0.00910884, 0.01358116],
       [0.02467067, 0.01879757, 0.        , 0.01521225, 0.01392789],
       [0.01828433, 0.00910884, 0.01521225, 0.        , 0.01121792],
       [0.01834735, 0.01358116, 0.01392789, 0.01121792, 0.        ]])

, with 100 we get

array([[0.        , 0.00282428, 0.00373375, 0.00331408, 0.00362875],
       [0.00282428, 0.        , 0.0036147 , 0.0028999 , 0.00210812],
       [0.00373375, 0.0036147 , 0.        , 0.0032986 , 0.00361321],
       [0.00331408, 0.0028999 , 0.0032986 , 0.        , 0.00318849],
       [0.00362875, 0.00210812, 0.00361321, 0.00318849, 0.        ]])

, and with 1000 steps:

array([[0.        , 0.00055459, 0.000633  , 0.00074271, 0.00036596],
       [0.00055459, 0.        , 0.00067211, 0.00075522, 0.00058975],
       [0.000633  , 0.00067211, 0.        , 0.00109709, 0.00049239],
       [0.00074271, 0.00075522, 0.00109709, 0.        , 0.00072527],
       [0.00036596, 0.00058975, 0.00049239, 0.00072527, 0.        ]])

The text was updated successfully, but these errors were encountered:

gojomo · 2018-06-11T20:52:32Z

Wow, that's a humongous bug going back to my initial implementation of this 3+years ago!

It should have been linear from the start, and I'm surprised inference has worked as well as it has, with this error.

Thanks for finding this!

umangv · 2018-06-11T21:07:48Z

No problem! I stumbled on it by accident and I'm glad I caught it. My guess is that this problem is far more exaggerated for smaller documents.

umangv mentioned this issue May 25, 2018

Fix linear decay for learning rate in Doc2Vec.infer_vector. Fix #2061 #2063

Merged

gojomo added the bug Issue described a bug label Jun 13, 2018

menshikh-iv closed this as completed in 8715c00 Jun 20, 2018

gojomo mentioned this issue Jul 4, 2018

[MRG] Doc2Vec inference, notebook cleanup #2103

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Doc2Vec.infer_vector learning rate decays extremely fast (non-linearly) #2061

Doc2Vec.infer_vector learning rate decays extremely fast (non-linearly) #2061

umangv commented May 25, 2018

gojomo commented Jun 11, 2018

umangv commented Jun 11, 2018

Doc2Vec.infer_vector learning rate decays extremely fast (non-linearly) #2061

Doc2Vec.infer_vector learning rate decays extremely fast (non-linearly) #2061

Comments

umangv commented May 25, 2018

gojomo commented Jun 11, 2018

umangv commented Jun 11, 2018