Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Word2Vec and Doc2Vec do not update word embeddings if negative keyword is set to 0 #1983

Closed
swierh opened this issue Mar 17, 2018 · 9 comments · Fixed by #3443
Closed

Word2Vec and Doc2Vec do not update word embeddings if negative keyword is set to 0 #1983

swierh opened this issue Mar 17, 2018 · 9 comments · Fixed by #3443
Labels
bug Issue described a bug difficulty easy Easy issue: required small fix documentation Current issue related to documentation

Comments

@swierh
Copy link

swierh commented Mar 17, 2018

Description

Setting the negative keyword to 0 for Doc2Vec causes the training to not update word embeddings after the random initialisation.
This happens silently and is behavior I wasn't expecting.

Steps/Code/Corpus to Reproduce

from sklearn.datasets import fetch_20newsgroups
import pandas as pd
from gensim.models import Doc2Vec, Word2Vec
from gensim.models.doc2vec import TaggedDocument

df = pd.DataFrame(fetch_20newsgroups().data)
df[0] = df[0].str.split(' ')
documents = df.apply(lambda x: TaggedDocument(x[0], x.index), axis=1)

model1a = Doc2Vec(documents, negative=1, epochs=1)
model1b = Doc2Vec(documents, negative=0, epochs=1)
model2a = Doc2Vec(documents, negative=1, epochs=2)
model2b = Doc2Vec(documents, negative=0, epochs=2)

print('model1a:', model1a.wv.most_similar('test'))
print('model1b:', model1b.wv.most_similar('test'))
print('model2a:', model2a.wv.most_similar('test'))
print('model2b:', model2b.wv.most_similar('test'))

model1a = Word2Vec(df[0], negative=1, iter=1)
model1b = Word2Vec(df[0], negative=0, iter=1)
model2a = Word2Vec(df[0], negative=1, iter=2)
model2b = Word2Vec(df[0], negative=0, iter=2)

print('model1a:', model1a.most_similar('test'))
print('model1b:', model1b.most_similar('test'))
print('model2a:', model2a.most_similar('test'))
print('model2b:', model2b.most_similar('test'))

Results

As can be seen below, the results for the models that have a negative=0 show the same results after 1 or 2 epochs of training, where the models with negative=1 show different (and somewhat more sensible) results.
Doc2Vec:

model1a:
[('time', 0.9929366111755371),
 ('either', 0.9923557639122009),
 ('up', 0.9921339154243469),
 ('problem', 0.9915313720703125),
 ('being', 0.9915310144424438),
 ('getting', 0.991266131401062),
 ('group', 0.991013765335083),
 ('keeping', 0.9908334016799927),
 ('players', 0.9906938672065735),
 ('further', 0.9902615547180176)]
 
model1b:
[('518-393-7228', 0.4212157428264618),
 ('anyone?', 0.4167076647281647),
 ('it,', 0.3915032744407654),
 ('deliver', 0.3873376250267029),
 ('Books', 0.3643316328525543),
 ('stuck', 0.35024553537368774),
 ("o'clock", 0.34999915957450867),
 ('(Dostoevsky)', 0.34075409173965454),
 ('(Thyagi', 0.33959853649139404),
 ('MSDOS', 0.3370114862918854)]
 
model2a:
[('chip,', 0.9874706268310547),
 ('moves', 0.9828106164932251),
 ('board', 0.9789682626724243),
 ('adding', 0.978229820728302),
 ('express', 0.9764397144317627),
 ('sport', 0.9763677716255188),
 ('correctly', 0.9756811261177063),
 ('restricted', 0.9725382328033447),
 ('concern', 0.9719469547271729),
 ('user,', 0.9711147546768188)]
 
model2b:
[('518-393-7228', 0.4212157428264618),
 ('anyone?', 0.4167076647281647),
 ('it,', 0.3915032744407654),
 ('deliver', 0.3873376250267029),
 ('Books', 0.3643316328525543),
 ('stuck', 0.35024553537368774),
 ("o'clock", 0.34999915957450867),
 ('(Dostoevsky)', 0.34075409173965454),
 ('(Thyagi', 0.33959853649139404),
 ('MSDOS', 0.3370114862918854)]

Word2Vec:

model1a:
[('moral', 0.9974657297134399),
 ('obvious', 0.9970457553863525),
 ('high', 0.9967347979545593),
 ('ago,', 0.9966593384742737),
 ('food', 0.9964239001274109),
 ('case,', 0.996358335018158),
 ('in,', 0.9963352084159851),
 ('problems', 0.996278703212738),
 ('doctor', 0.9962717890739441),
 ('kept', 0.9961578845977783)]

model1b:
[('Fulk)\nSubject:', 0.43792209029197693),
 ('weak', 0.3926801085472107),
 ('Provine', 0.382274866104126),
 ('suspension,', 0.37375500798225403),
 ('kirsch@staff.tc.umn.edu', 0.3638245761394501),
 ('negligible', 0.3633933365345001),
 ('frozen', 0.36065810918807983),
 ('notch', 0.35705092549324036),
 ('_|_|_', 0.3445291221141815),
 ('(Grant', 0.3377472162246704)]

model2a:
[('motorcycle', 0.9882928729057312),
 ('grounds', 0.9797140955924988),
 ('charge', 0.977377712726593),
 ('goes', 0.9750747084617615),
 ('trip', 0.9731242060661316),
 ('mark', 0.9729797840118408),
 ('needed', 0.9718480706214905),
 ('directly', 0.9717421531677246),
 ('group,', 0.9714720845222473),
 ('store', 0.971139669418335)]

model2b:
[('Fulk)\nSubject:', 0.43792209029197693),
 ('weak', 0.3926801085472107),
 ('Provine', 0.382274866104126),
 ('suspension,', 0.37375500798225403),
 ('kirsch@staff.tc.umn.edu', 0.3638245761394501),
 ('negligible', 0.3633933365345001),
 ('frozen', 0.36065810918807983),
 ('notch', 0.35705092549324036),
 ('_|_|_', 0.3445291221141815),
 ('(Grant', 0.3377472162246704)]

Logs during training

Doc2Vec

model1a:

2018-03-17 12:48:28,331 : INFO : PROGRESS: at example #0, processed 0 words (0/s), 0 word types, 0 tags
2018-03-17 12:48:28,966 : INFO : PROGRESS: at example #10000, processed 3198312 words (5044963/s), 390776 word types, 1 tags
2018-03-17 12:48:29,049 : INFO : collected 427021 word types and 1 unique tags from a corpus of 11314 examples and 3593473 words
2018-03-17 12:48:29,050 : INFO : Loading a fresh vocabulary
2018-03-17 12:48:29,283 : INFO : min_count=5 retains 40708 unique words (9% of original 427021, drops 386313)
2018-03-17 12:48:29,284 : INFO : min_count=5 leaves 3082977 word corpus (85% of original 3593473, drops 510496)
2018-03-17 12:48:29,373 : INFO : deleting the raw counts dictionary of 427021 items
2018-03-17 12:48:29,379 : INFO : sample=0.001 downsamples 32 most-common words
2018-03-17 12:48:29,379 : INFO : downsampling leaves estimated 2056839 word corpus (66.7% of prior 3082977)
2018-03-17 12:48:29,490 : INFO : estimated required memory for 40708 words and 100 dimensions: 52920800 bytes
2018-03-17 12:48:29,490 : INFO : resetting layer weights
2018-03-17 12:48:29,840 : INFO : training model with 3 workers on 40708 vocabulary and 100 features, using sg=0 hs=0 sample=0.001 negative=1 window=5
2018-03-17 12:48:30,848 : INFO : EPOCH 1 - PROGRESS: at 56.44% examples, 1165806 words/s, in_qsize 5, out_qsize 0
2018-03-17 12:48:31,587 : INFO : worker thread finished; awaiting finish of 2 more threads
2018-03-17 12:48:31,588 : INFO : worker thread finished; awaiting finish of 1 more threads
2018-03-17 12:48:31,594 : INFO : worker thread finished; awaiting finish of 0 more threads
2018-03-17 12:48:31,595 : INFO : EPOCH - 1 : training on 3593473 raw words (2068258 effective words) took 1.8s, 1180569 effective words/s
2018-03-17 12:48:31,595 : INFO : training on a 3593473 raw words (2068258 effective words) took 1.8s, 1178829 effective words/s

model1b:

2018-03-17 12:48:31,596 : INFO : collecting all words and their counts
2018-03-17 12:48:31,597 : INFO : PROGRESS: at example #0, processed 0 words (0/s), 0 word types, 0 tags
2018-03-17 12:48:32,172 : INFO : PROGRESS: at example #10000, processed 3198312 words (5569233/s), 390776 word types, 1 tags
2018-03-17 12:48:32,252 : INFO : collected 427021 word types and 1 unique tags from a corpus of 11314 examples and 3593473 words
2018-03-17 12:48:32,253 : INFO : Loading a fresh vocabulary
2018-03-17 12:48:32,413 : INFO : min_count=5 retains 40708 unique words (9% of original 427021, drops 386313)
2018-03-17 12:48:32,414 : INFO : min_count=5 leaves 3082977 word corpus (85% of original 3593473, drops 510496)
2018-03-17 12:48:32,506 : INFO : deleting the raw counts dictionary of 427021 items
2018-03-17 12:48:32,511 : INFO : sample=0.001 downsamples 32 most-common words
2018-03-17 12:48:32,512 : INFO : downsampling leaves estimated 2056839 word corpus (66.7% of prior 3082977)
2018-03-17 12:48:32,557 : INFO : estimated required memory for 40708 words and 100 dimensions: 36637600 bytes
2018-03-17 12:48:32,557 : INFO : resetting layer weights
2018-03-17 12:48:32,910 : INFO : training model with 3 workers on 40708 vocabulary and 100 features, using sg=0 hs=0 sample=0.001 negative=0 window=5
2018-03-17 12:48:33,913 : INFO : EPOCH 1 - PROGRESS: at 66.48% examples, 1398264 words/s, in_qsize 5, out_qsize 0
2018-03-17 12:48:34,396 : INFO : worker thread finished; awaiting finish of 2 more threads
2018-03-17 12:48:34,403 : INFO : worker thread finished; awaiting finish of 1 more threads
2018-03-17 12:48:34,409 : INFO : worker thread finished; awaiting finish of 0 more threads
2018-03-17 12:48:34,412 : INFO : EPOCH - 1 : training on 3593473 raw words (2067706 effective words) took 1.5s, 1378469 effective words/s
2018-03-17 12:48:34,412 : INFO : training on a 3593473 raw words (2067706 effective words) took 1.5s, 1376358 effective words/s

model2a:

2018-03-17 12:48:34,413 : INFO : collecting all words and their counts
2018-03-17 12:48:34,415 : INFO : PROGRESS: at example #0, processed 0 words (0/s), 0 word types, 0 tags
2018-03-17 12:48:35,073 : INFO : PROGRESS: at example #10000, processed 3198312 words (4869749/s), 390776 word types, 1 tags
2018-03-17 12:48:35,152 : INFO : collected 427021 word types and 1 unique tags from a corpus of 11314 examples and 3593473 words
2018-03-17 12:48:35,153 : INFO : Loading a fresh vocabulary
2018-03-17 12:48:35,452 : INFO : min_count=5 retains 40708 unique words (9% of original 427021, drops 386313)
2018-03-17 12:48:35,453 : INFO : min_count=5 leaves 3082977 word corpus (85% of original 3593473, drops 510496)
2018-03-17 12:48:35,563 : INFO : deleting the raw counts dictionary of 427021 items
2018-03-17 12:48:35,568 : INFO : sample=0.001 downsamples 32 most-common words
2018-03-17 12:48:35,569 : INFO : downsampling leaves estimated 2056839 word corpus (66.7% of prior 3082977)
2018-03-17 12:48:35,686 : INFO : estimated required memory for 40708 words and 100 dimensions: 52920800 bytes
2018-03-17 12:48:35,687 : INFO : resetting layer weights
2018-03-17 12:48:36,044 : INFO : training model with 3 workers on 40708 vocabulary and 100 features, using sg=0 hs=0 sample=0.001 negative=1 window=5
2018-03-17 12:48:37,052 : INFO : EPOCH 1 - PROGRESS: at 42.69% examples, 886813 words/s, in_qsize 5, out_qsize 0
2018-03-17 12:48:38,052 : INFO : EPOCH 1 - PROGRESS: at 98.46% examples, 1013776 words/s, in_qsize 5, out_qsize 0
2018-03-17 12:48:38,090 : INFO : worker thread finished; awaiting finish of 2 more threads
2018-03-17 12:48:38,091 : INFO : worker thread finished; awaiting finish of 1 more threads
2018-03-17 12:48:38,102 : INFO : worker thread finished; awaiting finish of 0 more threads
2018-03-17 12:48:38,102 : INFO : EPOCH - 1 : training on 3593473 raw words (2067777 effective words) took 2.1s, 1006127 effective words/s
2018-03-17 12:48:39,107 : INFO : EPOCH 2 - PROGRESS: at 57.09% examples, 1186646 words/s, in_qsize 5, out_qsize 0
2018-03-17 12:48:39,865 : INFO : worker thread finished; awaiting finish of 2 more threads
2018-03-17 12:48:39,873 : INFO : worker thread finished; awaiting finish of 1 more threads
2018-03-17 12:48:39,874 : INFO : worker thread finished; awaiting finish of 0 more threads
2018-03-17 12:48:39,875 : INFO : EPOCH - 2 : training on 3593473 raw words (2067767 effective words) took 1.8s, 1168711 effective words/s
2018-03-17 12:48:39,875 : INFO : training on a 7186946 raw words (4135544 effective words) took 3.8s, 1079693 effective words/s

model2b:

2018-03-17 12:48:39,876 : INFO : collecting all words and their counts
2018-03-17 12:48:39,878 : INFO : PROGRESS: at example #0, processed 0 words (0/s), 0 word types, 0 tags
2018-03-17 12:48:40,616 : INFO : PROGRESS: at example #10000, processed 3198312 words (4332917/s), 390776 word types, 1 tags
2018-03-17 12:48:40,697 : INFO : collected 427021 word types and 1 unique tags from a corpus of 11314 examples and 3593473 words
2018-03-17 12:48:40,698 : INFO : Loading a fresh vocabulary
2018-03-17 12:48:40,982 : INFO : min_count=5 retains 40708 unique words (9% of original 427021, drops 386313)
2018-03-17 12:48:40,985 : INFO : min_count=5 leaves 3082977 word corpus (85% of original 3593473, drops 510496)
2018-03-17 12:48:41,106 : INFO : deleting the raw counts dictionary of 427021 items
2018-03-17 12:48:41,111 : INFO : sample=0.001 downsamples 32 most-common words
2018-03-17 12:48:41,112 : INFO : downsampling leaves estimated 2056839 word corpus (66.7% of prior 3082977)
2018-03-17 12:48:41,156 : INFO : estimated required memory for 40708 words and 100 dimensions: 36637600 bytes
2018-03-17 12:48:41,157 : INFO : resetting layer weights
2018-03-17 12:48:41,517 : INFO : training model with 3 workers on 40708 vocabulary and 100 features, using sg=0 hs=0 sample=0.001 negative=0 window=5
2018-03-17 12:48:42,525 : INFO : EPOCH 1 - PROGRESS: at 65.36% examples, 1362872 words/s, in_qsize 5, out_qsize 0
2018-03-17 12:48:43,161 : INFO : worker thread finished; awaiting finish of 2 more threads
2018-03-17 12:48:43,166 : INFO : worker thread finished; awaiting finish of 1 more threads
2018-03-17 12:48:43,169 : INFO : worker thread finished; awaiting finish of 0 more threads
2018-03-17 12:48:43,170 : INFO : EPOCH - 1 : training on 3593473 raw words (2067697 effective words) took 1.7s, 1252184 effective words/s
2018-03-17 12:48:44,173 : INFO : EPOCH 2 - PROGRESS: at 55.19% examples, 1149188 words/s, in_qsize 5, out_qsize 0
2018-03-17 12:48:44,820 : INFO : worker thread finished; awaiting finish of 2 more threads
2018-03-17 12:48:44,825 : INFO : worker thread finished; awaiting finish of 1 more threads
2018-03-17 12:48:44,828 : INFO : worker thread finished; awaiting finish of 0 more threads
2018-03-17 12:48:44,829 : INFO : EPOCH - 2 : training on 3593473 raw words (2067746 effective words) took 1.7s, 1248269 effective words/s
2018-03-17 12:48:44,829 : INFO : training on a 7186946 raw words (4135443 effective words) took 3.3s, 1248589 effective words/s
Word2Vec

model1a:

2018-03-17 12:54:55,113 : INFO : collecting all words and their counts
2018-03-17 12:54:55,114 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2018-03-17 12:54:55,710 : INFO : PROGRESS: at sentence #10000, processed 3198312 words, keeping 390776 word types
2018-03-17 12:54:55,789 : INFO : collected 427021 word types from a corpus of 3593473 raw words and 11314 sentences
2018-03-17 12:54:55,789 : INFO : Loading a fresh vocabulary
2018-03-17 12:54:55,943 : INFO : min_count=5 retains 40708 unique words (9% of original 427021, drops 386313)
2018-03-17 12:54:55,943 : INFO : min_count=5 leaves 3082977 word corpus (85% of original 3593473, drops 510496)
2018-03-17 12:54:56,036 : INFO : deleting the raw counts dictionary of 427021 items
2018-03-17 12:54:56,041 : INFO : sample=0.001 downsamples 32 most-common words
2018-03-17 12:54:56,042 : INFO : downsampling leaves estimated 2056839 word corpus (66.7% of prior 3082977)
2018-03-17 12:54:56,154 : INFO : estimated required memory for 40708 words and 100 dimensions: 52920400 bytes
2018-03-17 12:54:56,154 : INFO : resetting layer weights
2018-03-17 12:54:56,501 : INFO : training model with 3 workers on 40708 vocabulary and 100 features, using sg=0 hs=0 sample=0.001 negative=1 window=5
2018-03-17 12:54:57,505 : INFO : EPOCH 1 - PROGRESS: at 78.98% examples, 1626661 words/s, in_qsize 5, out_qsize 0
2018-03-17 12:54:57,768 : INFO : worker thread finished; awaiting finish of 2 more threads
2018-03-17 12:54:57,769 : INFO : worker thread finished; awaiting finish of 1 more threads
2018-03-17 12:54:57,772 : INFO : worker thread finished; awaiting finish of 0 more threads
2018-03-17 12:54:57,773 : INFO : EPOCH - 1 : training on 3593473 raw words (2056644 effective words) took 1.3s, 1620306 effective words/s
2018-03-17 12:54:57,773 : INFO : training on a 3593473 raw words (2056644 effective words) took 1.3s, 1618042 effective words/s

model1b:

2018-03-17 12:54:57,780 : INFO : collecting all words and their counts
2018-03-17 12:54:57,782 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2018-03-17 12:54:58,335 : INFO : PROGRESS: at sentence #10000, processed 3198312 words, keeping 390776 word types
2018-03-17 12:54:58,412 : INFO : collected 427021 word types from a corpus of 3593473 raw words and 11314 sentences
2018-03-17 12:54:58,413 : INFO : Loading a fresh vocabulary
2018-03-17 12:54:58,703 : INFO : min_count=5 retains 40708 unique words (9% of original 427021, drops 386313)
2018-03-17 12:54:58,703 : INFO : min_count=5 leaves 3082977 word corpus (85% of original 3593473, drops 510496)
2018-03-17 12:54:58,791 : INFO : deleting the raw counts dictionary of 427021 items
2018-03-17 12:54:58,796 : INFO : sample=0.001 downsamples 32 most-common words
2018-03-17 12:54:58,797 : INFO : downsampling leaves estimated 2056839 word corpus (66.7% of prior 3082977)
2018-03-17 12:54:58,841 : INFO : estimated required memory for 40708 words and 100 dimensions: 36637200 bytes
2018-03-17 12:54:58,842 : INFO : resetting layer weights
2018-03-17 12:54:59,194 : INFO : training model with 3 workers on 40708 vocabulary and 100 features, using sg=0 hs=0 sample=0.001 negative=0 window=5
2018-03-17 12:55:00,184 : INFO : worker thread finished; awaiting finish of 2 more threads
2018-03-17 12:55:00,189 : INFO : worker thread finished; awaiting finish of 1 more threads
2018-03-17 12:55:00,189 : INFO : worker thread finished; awaiting finish of 0 more threads
2018-03-17 12:55:00,190 : INFO : EPOCH - 1 : training on 3593473 raw words (2057464 effective words) took 1.0s, 2069005 effective words/s
2018-03-17 12:55:00,190 : INFO : training on a 3593473 raw words (2057464 effective words) took 1.0s, 2064908 effective words/s

model2a:

2018-03-17 12:55:00,198 : INFO : collecting all words and their counts
2018-03-17 12:55:00,200 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2018-03-17 12:55:00,760 : INFO : PROGRESS: at sentence #10000, processed 3198312 words, keeping 390776 word types
2018-03-17 12:55:00,835 : INFO : collected 427021 word types from a corpus of 3593473 raw words and 11314 sentences
2018-03-17 12:55:00,836 : INFO : Loading a fresh vocabulary
2018-03-17 12:55:01,001 : INFO : min_count=5 retains 40708 unique words (9% of original 427021, drops 386313)
2018-03-17 12:55:01,001 : INFO : min_count=5 leaves 3082977 word corpus (85% of original 3593473, drops 510496)
2018-03-17 12:55:01,102 : INFO : deleting the raw counts dictionary of 427021 items
2018-03-17 12:55:01,108 : INFO : sample=0.001 downsamples 32 most-common words
2018-03-17 12:55:01,108 : INFO : downsampling leaves estimated 2056839 word corpus (66.7% of prior 3082977)
2018-03-17 12:55:01,215 : INFO : estimated required memory for 40708 words and 100 dimensions: 52920400 bytes
2018-03-17 12:55:01,215 : INFO : resetting layer weights
2018-03-17 12:55:01,583 : INFO : training model with 3 workers on 40708 vocabulary and 100 features, using sg=0 hs=0 sample=0.001 negative=1 window=5
2018-03-17 12:55:02,588 : INFO : EPOCH 1 - PROGRESS: at 70.71% examples, 1471948 words/s, in_qsize 5, out_qsize 0
2018-03-17 12:55:02,957 : INFO : worker thread finished; awaiting finish of 2 more threads
2018-03-17 12:55:02,958 : INFO : worker thread finished; awaiting finish of 1 more threads
2018-03-17 12:55:02,960 : INFO : worker thread finished; awaiting finish of 0 more threads
2018-03-17 12:55:02,961 : INFO : EPOCH - 1 : training on 3593473 raw words (2056026 effective words) took 1.4s, 1494852 effective words/s
2018-03-17 12:55:03,970 : INFO : EPOCH 2 - PROGRESS: at 78.28% examples, 1614596 words/s, in_qsize 6, out_qsize 0
2018-03-17 12:55:04,240 : INFO : worker thread finished; awaiting finish of 2 more threads
2018-03-17 12:55:04,241 : INFO : worker thread finished; awaiting finish of 1 more threads
2018-03-17 12:55:04,244 : INFO : worker thread finished; awaiting finish of 0 more threads
2018-03-17 12:55:04,245 : INFO : EPOCH - 2 : training on 3593473 raw words (2057234 effective words) took 1.3s, 1611148 effective words/s
2018-03-17 12:55:04,245 : INFO : training on a 7186946 raw words (4113260 effective words) took 2.7s, 1545423 effective words/s

model2b:

2018-03-17 12:55:04,255 : INFO : collecting all words and their counts
2018-03-17 12:55:04,257 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2018-03-17 12:55:04,810 : INFO : PROGRESS: at sentence #10000, processed 3198312 words, keeping 390776 word types
2018-03-17 12:55:04,882 : INFO : collected 427021 word types from a corpus of 3593473 raw words and 11314 sentences
2018-03-17 12:55:04,882 : INFO : Loading a fresh vocabulary
2018-03-17 12:55:05,177 : INFO : min_count=5 retains 40708 unique words (9% of original 427021, drops 386313)
2018-03-17 12:55:05,177 : INFO : min_count=5 leaves 3082977 word corpus (85% of original 3593473, drops 510496)
2018-03-17 12:55:05,278 : INFO : deleting the raw counts dictionary of 427021 items
2018-03-17 12:55:05,283 : INFO : sample=0.001 downsamples 32 most-common words
2018-03-17 12:55:05,284 : INFO : downsampling leaves estimated 2056839 word corpus (66.7% of prior 3082977)
2018-03-17 12:55:05,329 : INFO : estimated required memory for 40708 words and 100 dimensions: 36637200 bytes
2018-03-17 12:55:05,329 : INFO : resetting layer weights
2018-03-17 12:55:05,674 : INFO : training model with 3 workers on 40708 vocabulary and 100 features, using sg=0 hs=0 sample=0.001 negative=0 window=5
2018-03-17 12:55:06,562 : INFO : worker thread finished; awaiting finish of 2 more threads
2018-03-17 12:55:06,566 : INFO : worker thread finished; awaiting finish of 1 more threads
2018-03-17 12:55:06,567 : INFO : worker thread finished; awaiting finish of 0 more threads
2018-03-17 12:55:06,567 : INFO : EPOCH - 1 : training on 3593473 raw words (2056953 effective words) took 0.9s, 2308747 effective words/s
2018-03-17 12:55:07,450 : INFO : worker thread finished; awaiting finish of 2 more threads
2018-03-17 12:55:07,451 : INFO : worker thread finished; awaiting finish of 1 more threads
2018-03-17 12:55:07,458 : INFO : worker thread finished; awaiting finish of 0 more threads
2018-03-17 12:55:07,459 : INFO : EPOCH - 2 : training on 3593473 raw words (2056387 effective words) took 0.9s, 2313723 effective words/s
2018-03-17 12:55:07,462 : INFO : training on a 7186946 raw words (4113340 effective words) took 1.8s, 2300980 effective words/s

Versions

Python 3.6.3 (default, Oct  3 2017, 21:45:48) 
[GCC 7.2.0]
NumPy 1.14.1
SciPy 1.0.0
gensim 3.4.0
FAST_VERSION 1
@menshikh-iv
Copy link
Contributor

Thanks for report @swierh 👍

CC: @gojomo @manneshiva this is expected behavior or how this should works?

@gojomo
Copy link
Collaborator

gojomo commented Mar 20, 2018

I agree it could be surprising, and there should be a warning or exception when this error is made. (The chief hint currently is the near-instantaneous training. You might get a similar fast-but-useless result if setting window=0 or size=0 or min_count or sample at some extreme values that drop all/almost-all words.)

But note there's a level at which this behavior makes logical sense: with zero negative examples with which to do negative-sampling, and with hierarchical-softmax not enabled (left at its default hs=0 value), there is no backprop-correction method specified, and thus all 'training' is necessarily a no-op. The user is getting what they (mistakenly) requested: an initialized model with no backprop-learning method configured.

@menshikh-iv menshikh-iv added bug Issue described a bug documentation Current issue related to documentation difficulty easy Easy issue: required small fix labels Mar 21, 2018
@aneesh-joshi
Copy link
Contributor

@gojomo
should we log a warning when such extreme cases are set so that the problem isn't silent if it isn't what the user wants?

@gojomo
Copy link
Collaborator

gojomo commented Jan 24, 2019

A warning or even ValueError if both hs is False-ish and negative is 0 would make sense.

@gojomo
Copy link
Collaborator

gojomo commented Nov 3, 2019

The warning should also happen if this error is made with FastText: https://groups.google.com/d/msg/gensim/9WPVoeiq8Mk/V_x7_L6bAgAJ

@gau-nernst
Copy link
Contributor

gau-nernst commented Feb 15, 2023

I was looking for full softmax training (by setting hs=0 and negative=0) in gensim's Word2Vec implementation and could not found any. Upon investigation, I discovered that gensim's Word2Vec does not support full softmax training. I believe the original Google C implementation also does not have full softmax training implementation.

Since gensim only supports either hierarchical softmax or negative sampling training, I propose that there should be a check to not allow setting negative=0. Or even better, gensim should make sure negative > 0. hs parameter is used to specify whether hierarchical softmax or negative sampling is used. Therefore, there is no reason to have negative = 0. The doc saying If set to 0, no negative sampling is used. is misleading because (1) whether negative sampling is used is determined by hs parameter, not by negative parameter, and (2) at least from what I understand, "no negative sampling is used" would mean "full softmax training".

If the maintainers are okay, I can submit a PR to raise ValueError when negative < 0.

Also, perhaps it is good to update the documentation to note that full softmax training is not supported.

@gojomo
Copy link
Collaborator

gojomo commented Feb 15, 2023

A PR for better user messaging when nonsensical parameters are used would be appreciated! The existing _check_training_sanity() helper method might be a good place for such checking. Note that 'hs' does not control negative-sampling at all, just whether hierarchical-softmax is independently initialized/trained. You must set negative=0 to prevent negative-sampling from being initialized & trained. Essentially, either negative or hs must be nonzero for any training mode at all to be active. (And if both are nonzero, it's probably an mistake – there's not likely a good reason for interleaving both methods – but not technically illegal, and was supported in the original Google word2vec.c.)

However, common word2vec implementations (except in academic demos) often don't offer full softmax, as using the shortcuts of either negative-sampling, or hierarchical-softmax, were essential for word2vec to be practical with corpora/vocabularies of usual interest. The Google word2vec.c release on which Gensim was originally based didn't offer full softmax, and Gensim has never offered it, nor mentioned it in docs.

So I think any expectation that "no negative sampling" means softmax would be used instead involves unwarranted assumptions not based on Gensim docs or similar-library precedents. A few words to armor against that assumption could still be beneficial, if minimal in just the right place(s). But a good error message if user tries negative=0, hs=0 might be enough.

@gau-nernst
Copy link
Contributor

Thank you for the clarification and explanation, I understand it better now. I will submit a PR doing what you suggest.

I was looking at Google C source code yesterday, and saw the logic is

if (hs)
    (do hierarchical softmax training)
if (negative > 0)
    (do negative sampling training)

So hierarchical softmax training and negative training are independent. Gensim also follows this logic, since gensim is a direct port I believe

https://github.com/RaRe-Technologies/gensim/blob/f35faae7a7b0c3c8586fb61208560522e37e0e7e/gensim/models/word2vec_inner.pyx#L587-L590

So apart from having hs=0 and negative=0 being problematic, I think setting both of them to be non-zero is also problematic? Although it is not illegal to do both hierarchical softmax and negative sampling training at the same time, I think most likely it is not intended? (I'm not sure if anyone uses both training at the same time, correct me if I'm wrong). If it is indeed something most likely unintended, raising an error when both hs and negative are non-zero should be sufficient I believe.

Also, the docs says

hs ({0, 1}, optional) – If 1, hierarchical softmax will be used for model training. If 0, and negative is non-zero, negative sampling will be used.

I think this is confusing to say If 0, and negative is non-zero, negative sampling will be used., since when hs=1, negative sampling is still used if negative > 0 (which again, is about hierarchical softmax and negative sampling being independent). This was the source of my confusion.

Correct me if I understand any part wrongly. All in all, I think the confusion comes from not-so-intuitive API from Google C source code. There should be 1 argument to specify loss function so that we won't get no loss function or multiple loss functions.

@gojomo
Copy link
Collaborator

gojomo commented Feb 16, 2023

Yes, Gensim's implementation started as a direct port from the Google C code, so inherited its peculiarities.

In practice, anyone who has negative and hs both activated (with non-negative values) is probably making a mistake. (The way in which the two modes do interleaved updates, from their two contrasting output-layers, but both update the same shared set of input word-vectors is unique. If someone is evaluating results in a space- and time- oblivious manner, it may superficially appear to be a high-performing choice: the same epochs count is actually doing twice as much training, with commensurate extra model space costs & time costs. But I don't think I've ever seen a case where choosing one mode, and giving it some extra time-in-epochs or space-in-dimensions-or-vocab, wasn't even better for a particular dataset/end-goal.)

But, it's worked this way for so long we'd not want to break it unless as part of a more-general and advance-advised API cleanup. So for now, while an error-that-must-be-fixed is appropriate if both are 0/disabled, if both are active we should just emit some sort of warning that allows the functional but likely suboptimal approach to proceed. But: clearer language in API docs always a good idea, if it can prevent misinterpretations arising from earlier wordings.

(Similar confusion was seen in #2550 and #2844, and I thought we'd added some sort of warning for one or both of the confused cases, but maybe that was in some exploratory work never integrated.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue described a bug difficulty easy Easy issue: required small fix documentation Current issue related to documentation
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants