Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Doc2Vec dropping sentences after training? #325

Closed
craigpfeifer opened this issue Apr 17, 2015 · 5 comments
Closed

Doc2Vec dropping sentences after training? #325

craigpfeifer opened this issue Apr 17, 2015 · 5 comments

Comments

@craigpfeifer
Copy link

I'm finding that all training sentences & their labels do not occur in the final trained model

I create one sentence per line in my file, where each sentence has a unique key of "EDGE_"+line_number:

label = "EDGE_"+str(line_num)
all_labels.append(label)
a_sent = gensim.models.doc2vec.LabeledSentence(toks, [label])

model = gensim.models.Doc2Vec()
model.build_vocab(sent_list)
model.train(sent_list)

for a_label in all_labels:
if (model.contains(a_label):
print "Label present"

I build the vocab & train the model. Then I want to iterate over all the trained vectors from the labels, however I find that not all of labels are present in the model, based on the code above.

In my current dataset of 94,460 lines (sentences), only 70,818 of the lines are present in the model. My data set contains duplicate lines (34,789 lines are dupes) leaving 59,671 unique lines, however each label is unique.

@cscorley
Copy link
Contributor

I think this might have to do with min_count. It filters out words (in the word2vec case) and documents (in the doc2vec case) that do not meet the count, so setting that to something lower than 5 should help.

@dineshbvadhia
Copy link

If "EDGE_"+line_number is the unique identifier (and doesn't appear in the actual content) then min_count=1 has to be set otherwise the labels won't be present in the model - is that correct?

@piskvorky
Copy link
Owner

@gojomo 's doc2vec improvements will solve this annoyance. Gordon is working on a cleaner API for doc2vec.

@gojomo
Copy link
Collaborator

gojomo commented Jun 10, 2015

The big docvec changes (which among other things make doc tags/indexes independent of the vocab-focused min_count) are ready for review in PR #356 .

@gojomo
Copy link
Collaborator

gojomo commented Jun 28, 2015

Closing as fixed by merge of #356 into develop.

@gojomo gojomo closed this as completed Jun 28, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants