Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix implementation of smartirs Document Frequency n #2020

Closed
PeteBleackley opened this issue Apr 6, 2018 · 8 comments
Closed

Fix implementation of smartirs Document Frequency n #2020

PeteBleackley opened this issue Apr 6, 2018 · 8 comments
Labels
bug Issue described a bug difficulty easy Easy issue: required small fix

Comments

@PeteBleackley
Copy link
Contributor

Description

tfidfmodel.updated_wglobal(docfreq, totaldoc, 'n') returns utils.identity(docfreq) instead of 1.
This means that all term frequencies will be multiplied by the document frequency. For a large corpus this is particularly bad and will cause normalisation to crash.

@menshikh-iv menshikh-iv added bug Issue described a bug difficulty easy Easy issue: required small fix labels Apr 6, 2018
@markroxor
Copy link
Contributor

Please take a look at the model again wglobal is actually used here

return {termid: wglobal(df, total_docs) for termid, df in iteritems(dfs)}

wglobal returns the normalized doc_freq it is not a normalizing factor therefore when tfidfmodel.updated_wglobal(docfreq, totaldoc, 'n') is used it should return docfreq

@markroxor
Copy link
Contributor

@menshikh-iv please close this issue.

@PeteBleackley
Copy link
Contributor Author

@markroxor According to https://en.m.wikipedia.org/wiki/SMART_Information_Retrieval_System , df="n" should return 1. Returning docfreq causes serious problems with a large corpus (*nc crashes due to a numerical overflow), and it was because of this that I raised the issue and the PR.

@markroxor
Copy link
Contributor

I have just used the terminologies and not the exact approach. Please go through the code bit by bit and you will understand.

@PeteBleackley
Copy link
Contributor Author

@markroxor I've just spent a week working out why my code kept crashing, and it turned out to be because the implementation of SMART doesn't match the published specification. I think you wanted "n" to mean "Do not modify the argument" in all positions, but that doesn't make sense for document frequencies, especially as you generally want smaller weights for larger DFs. It would make more sense to think of it as "Do not modify TF".

@markroxor
Copy link
Contributor

markroxor commented Apr 8, 2018

I think you wanted "n" to mean "Do not modify the argument" in all positions,

You are right but can you see the code flow and where exactly the function updated_wglobal is used?
It is used here.

According to your modification updated_global will return 1, and therefore the precompute_idfs will return a dictionary with all values as 1. Please go through the code flow.

I've just spent a week working out why my code kept crashing, and it turned out to be because the implementation of SMART doesn't match the published specification.

The problem can possibly be with your code and not with tf-idf implementation. Feel free to share your code at the gensim gitter channel.

@PeteBleackley
Copy link
Contributor Author

PeteBleackley commented Apr 8, 2018

@markroxor Yes, precompute_idfs should return all ones if the df argument is n.

To reproduce the bug I encountered, use the following corpus class.

import nltk.corpus
import gensim

class BrownCorpus(gensim.corpora.TextCorpus):
    
    def __init__(self,input):
        self.corpus=nltk.corpus.brown
        super(BrownCorpus,self).__init__(input)
        
        
    def get_texts(self):
        for para in self.corpus.paras():
            result=[]
            for sentence in para:
                result.extend(sentence)
            yield result

Train a TfidfModel on that with smartirs="nnc". Then try to transform a document with it.

@markroxor
Copy link
Contributor

Yes, precompute_idfs should return all ones if the df argument is n.

If that is what is intended than I agree with your concern. I am going to review your PR suggesting the necessary changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue described a bug difficulty easy Easy issue: required small fix
Projects
None yet
Development

No branches or pull requests

3 participants