-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Doc2Vec dropping sentences after training? #325
Comments
I think this might have to do with |
If "EDGE_"+line_number is the unique identifier (and doesn't appear in the actual content) then min_count=1 has to be set otherwise the labels won't be present in the model - is that correct? |
@gojomo 's doc2vec improvements will solve this annoyance. Gordon is working on a cleaner API for doc2vec. |
The big docvec changes (which among other things make doc tags/indexes independent of the vocab-focused min_count) are ready for review in PR #356 . |
Closing as fixed by merge of #356 into develop. |
I'm finding that all training sentences & their labels do not occur in the final trained model
I create one sentence per line in my file, where each sentence has a unique key of "EDGE_"+line_number:
label = "EDGE_"+str(line_num)
all_labels.append(label)
a_sent = gensim.models.doc2vec.LabeledSentence(toks, [label])
model = gensim.models.Doc2Vec()
model.build_vocab(sent_list)
model.train(sent_list)
for a_label in all_labels:
if (model.contains(a_label):
print "Label present"
I build the vocab & train the model. Then I want to iterate over all the trained vectors from the labels, however I find that not all of labels are present in the model, based on the code above.
In my current dataset of 94,460 lines (sentences), only 70,818 of the lines are present in the model. My data set contains duplicate lines (34,789 lines are dupes) leaving 59,671 unique lines, however each label is unique.
The text was updated successfully, but these errors were encountered: