Fix implementation of smartirs Document Frequency n #2020

PeteBleackley · 2018-04-06T10:38:45Z

Description

tfidfmodel.updated_wglobal(docfreq, totaldoc, 'n') returns utils.identity(docfreq) instead of 1.
This means that all term frequencies will be multiplied by the document frequency. For a large corpus this is particularly bad and will cause normalisation to crash.

markroxor · 2018-04-07T09:25:25Z

Please take a look at the model again wglobal is actually used here

return {termid: wglobal(df, total_docs) for termid, df in iteritems(dfs)}

wglobal returns the normalized doc_freq it is not a normalizing factor therefore when tfidfmodel.updated_wglobal(docfreq, totaldoc, 'n') is used it should return docfreq

markroxor · 2018-04-07T09:25:37Z

@menshikh-iv please close this issue.

PeteBleackley · 2018-04-07T17:34:29Z

@markroxor According to https://en.m.wikipedia.org/wiki/SMART_Information_Retrieval_System , df="n" should return 1. Returning docfreq causes serious problems with a large corpus (*nc crashes due to a numerical overflow), and it was because of this that I raised the issue and the PR.

markroxor · 2018-04-08T07:04:32Z

I have just used the terminologies and not the exact approach. Please go through the code bit by bit and you will understand.

PeteBleackley · 2018-04-08T09:01:27Z

@markroxor I've just spent a week working out why my code kept crashing, and it turned out to be because the implementation of SMART doesn't match the published specification. I think you wanted "n" to mean "Do not modify the argument" in all positions, but that doesn't make sense for document frequencies, especially as you generally want smaller weights for larger DFs. It would make more sense to think of it as "Do not modify TF".

markroxor · 2018-04-08T13:38:55Z

I think you wanted "n" to mean "Do not modify the argument" in all positions,

You are right but can you see the code flow and where exactly the function updated_wglobal is used?
It is used here.

According to your modification updated_global will return 1, and therefore the precompute_idfs will return a dictionary with all values as 1. Please go through the code flow.

I've just spent a week working out why my code kept crashing, and it turned out to be because the implementation of SMART doesn't match the published specification.

The problem can possibly be with your code and not with tf-idf implementation. Feel free to share your code at the gensim gitter channel.

PeteBleackley · 2018-04-08T14:11:21Z

@markroxor Yes, precompute_idfs should return all ones if the df argument is n.

To reproduce the bug I encountered, use the following corpus class.

import nltk.corpus
import gensim

class BrownCorpus(gensim.corpora.TextCorpus):
    
    def __init__(self,input):
        self.corpus=nltk.corpus.brown
        super(BrownCorpus,self).__init__(input)
        
        
    def get_texts(self):
        for para in self.corpus.paras():
            result=[]
            for sentence in para:
                result.extend(sentence)
            yield result

Train a TfidfModel on that with smartirs="nnc". Then try to transform a document with it.

markroxor · 2018-04-08T14:39:25Z

Yes, precompute_idfs should return all ones if the df argument is n.

If that is what is intended than I agree with your concern. I am going to review your PR suggesting the necessary changes.

PeteBleackley mentioned this issue Apr 6, 2018

Fix SMART from TfidfModel for case when df == "n". Fix #2020 #2021

Merged

menshikh-iv added bug Issue described a bug difficulty easy Easy issue: required small fix labels Apr 6, 2018

menshikh-iv closed this as completed in 06f5f5c Apr 10, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix implementation of smartirs Document Frequency n #2020

Fix implementation of smartirs Document Frequency n #2020

PeteBleackley commented Apr 6, 2018

markroxor commented Apr 7, 2018

markroxor commented Apr 7, 2018

PeteBleackley commented Apr 7, 2018

markroxor commented Apr 8, 2018

PeteBleackley commented Apr 8, 2018

markroxor commented Apr 8, 2018 •

edited

Loading

PeteBleackley commented Apr 8, 2018 •

edited by piskvorky

Loading

markroxor commented Apr 8, 2018

Fix implementation of smartirs Document Frequency n #2020

Fix implementation of smartirs Document Frequency n #2020

Comments

PeteBleackley commented Apr 6, 2018

Description

markroxor commented Apr 7, 2018

markroxor commented Apr 7, 2018

PeteBleackley commented Apr 7, 2018

markroxor commented Apr 8, 2018

PeteBleackley commented Apr 8, 2018

markroxor commented Apr 8, 2018 • edited Loading

PeteBleackley commented Apr 8, 2018 • edited by piskvorky Loading

markroxor commented Apr 8, 2018

markroxor commented Apr 8, 2018 •

edited

Loading

PeteBleackley commented Apr 8, 2018 •

edited by piskvorky

Loading