-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix implementation of smartirs Document Frequency n #2020
Comments
Please take a look at the model again return {termid: wglobal(df, total_docs) for termid, df in iteritems(dfs)}
|
@menshikh-iv please close this issue. |
@markroxor According to https://en.m.wikipedia.org/wiki/SMART_Information_Retrieval_System , df="n" should return 1. Returning docfreq causes serious problems with a large corpus (*nc crashes due to a numerical overflow), and it was because of this that I raised the issue and the PR. |
I have just used the terminologies and not the exact approach. Please go through the code bit by bit and you will understand. |
@markroxor I've just spent a week working out why my code kept crashing, and it turned out to be because the implementation of SMART doesn't match the published specification. I think you wanted "n" to mean "Do not modify the argument" in all positions, but that doesn't make sense for document frequencies, especially as you generally want smaller weights for larger DFs. It would make more sense to think of it as "Do not modify TF". |
You are right but can you see the code flow and where exactly the function According to your modification
The problem can possibly be with your code and not with tf-idf implementation. Feel free to share your code at the gensim gitter channel. |
@markroxor Yes, precompute_idfs should return all ones if the df argument is n. To reproduce the bug I encountered, use the following corpus class. import nltk.corpus
import gensim
class BrownCorpus(gensim.corpora.TextCorpus):
def __init__(self,input):
self.corpus=nltk.corpus.brown
super(BrownCorpus,self).__init__(input)
def get_texts(self):
for para in self.corpus.paras():
result=[]
for sentence in para:
result.extend(sentence)
yield result Train a TfidfModel on that with smartirs="nnc". Then try to transform a document with it. |
If that is what is intended than I agree with your concern. I am going to review your PR suggesting the necessary changes. |
Description
tfidfmodel.updated_wglobal(docfreq, totaldoc, 'n') returns utils.identity(docfreq) instead of 1.
This means that all term frequencies will be multiplied by the document frequency. For a large corpus this is particularly bad and will cause normalisation to crash.
The text was updated successfully, but these errors were encountered: