Doc2Vec dropping sentences after training? #325

craigpfeifer · 2015-04-17T16:23:44Z

I'm finding that all training sentences & their labels do not occur in the final trained model

I create one sentence per line in my file, where each sentence has a unique key of "EDGE_"+line_number:

label = "EDGE_"+str(line_num)
all_labels.append(label)
a_sent = gensim.models.doc2vec.LabeledSentence(toks, [label])

model = gensim.models.Doc2Vec()
model.build_vocab(sent_list)
model.train(sent_list)

for a_label in all_labels:
if (model.contains(a_label):
print "Label present"

I build the vocab & train the model. Then I want to iterate over all the trained vectors from the labels, however I find that not all of labels are present in the model, based on the code above.

In my current dataset of 94,460 lines (sentences), only 70,818 of the lines are present in the model. My data set contains duplicate lines (34,789 lines are dupes) leaving 59,671 unique lines, however each label is unique.

cscorley · 2015-04-20T20:41:14Z

I think this might have to do with min_count. It filters out words (in the word2vec case) and documents (in the doc2vec case) that do not meet the count, so setting that to something lower than 5 should help.

dineshbvadhia · 2015-04-22T08:28:58Z

If "EDGE_"+line_number is the unique identifier (and doesn't appear in the actual content) then min_count=1 has to be set otherwise the labels won't be present in the model - is that correct?

piskvorky · 2015-04-22T11:33:01Z

@gojomo 's doc2vec improvements will solve this annoyance. Gordon is working on a cleaner API for doc2vec.

gojomo · 2015-06-10T23:59:15Z

The big docvec changes (which among other things make doc tags/indexes independent of the vocab-focused min_count) are ready for review in PR #356 .

gojomo · 2015-06-28T22:58:37Z

Closing as fixed by merge of #356 into develop.

gojomo mentioned this issue Jun 11, 2015

OverflowError: Python int too large to convert to C long #321

Closed

gojomo closed this as completed Jun 28, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Doc2Vec dropping sentences after training? #325

Doc2Vec dropping sentences after training? #325

craigpfeifer commented Apr 17, 2015

cscorley commented Apr 20, 2015

dineshbvadhia commented Apr 22, 2015

piskvorky commented Apr 22, 2015

gojomo commented Jun 10, 2015

gojomo commented Jun 28, 2015

Doc2Vec dropping sentences after training? #325

Doc2Vec dropping sentences after training? #325

Comments

craigpfeifer commented Apr 17, 2015

cscorley commented Apr 20, 2015

dineshbvadhia commented Apr 22, 2015

piskvorky commented Apr 22, 2015

gojomo commented Jun 10, 2015

gojomo commented Jun 28, 2015