TextCorpus doesn't provide a way to convert document text to indices as needed for say DL NLP models #1634

roopalgarg · 2017-10-18T00:48:09Z

Description

TextCorpus doesn't provide a way to convert text in a document to indices per the dictionary as needed for say Deep Learning NLP models. TextCorpus uses Dictionary objects doc2bow function which is great for most ML models but for DL models where we need sequential indices for text its not usable in most cases.

Steps/Code/Corpus to Reproduce

sample.txt

hello how are you ?
i am good

code:

from gensim.corpora.textcorpus import TextCorpus

some_file_name = "sample.txt"
some_dictionary = {
    '<UNK>': 0,
    'how': 1,
    'hello': 2,
    'hi': 3,
    'are': 4,
    'you': 5,
    '?': 6,
    'good': 7
}

gensim_dictionary = Dictionary()
gensim_dictionary.token2id = some_dictionary

txt_corpus = TextCorpus(input=some_file_name, dictionary=gensim_dictionary, token_filters=[])

for text in txt_corpus:
    print list(text)

Expected Results

Some way to simply convert the corpus to indices per the token2id dict object in Dictionary class, also adding in option to provide an unknown token id which replaces all unknown tokens.
So either adding a doc2idx() in TextCorpus or integrating that in Dictionary class along with doc2bow()
[2, 1, 4, 5, 6]
[0, 0, 7]

Actual Results

[(1, 1), (2, 1), (4, 1), (5, 1), (6, 1)]
[(7, 1)]

Versions

Darwin-16.4.0-x86_64-i386-64bit
('Python', '2.7.12 |Anaconda custom (x86_64)| (default, Jul 2 2016, 17:43:17) \n[GCC 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2336.11.00)]')
('NumPy', '1.13.1')
('SciPy', '0.19.1')

The text was updated successfully, but these errors were encountered:

menshikh-iv · 2017-10-19T06:54:20Z

Hi @roopalgarg, doc2bow returns data in BagOfWords format (unordered), this works fine for many models from gensim. Also, doc2bow returns frequency for each word in the document (second element in the tuple).

Add method doc2idx may be a good idea, wdyt @piskvorky @gojomo?

roopalgarg · 2017-10-19T17:15:50Z

awesome! excited to work on it... i will wait for a confirmation from your end if we are actually going to add the doc2idx feature. I am assuming we would add it to the Dictionary class?

roopalgarg · 2017-10-28T07:30:50Z

@menshikh-iv @piskvorky @gojomo any updates on this?
my use case is mainly where we want to convert a document into a series of indices per a word -> word_id mapping as is needed for Deep Learning based NLP models.
I had a couple of questions around adding a doc2idx feature in the Dictionary class since the class itself does a lot of house keeping with the allow_update parameter in doc2bow, should the same kind of house keeping be done for the doc2idx feature as well? It might be an overall for this feature if its used mainly for DL models.
Or should I simply add a feature like get_texts_idx to the TextCorpus class? It would work similar to the __iter__ but instead of calling self.dictionary.doc2bow() would convert the text to indices using the token2id in the Dictionary class?

…iskvorky#1720) * define doc2idx to convert a document to a vector of indexes per the dictionary * update documentation * changes to textcorpus to add a mode for index vector format output. adding test case for the changes * fixing doc string * fix doc string * fix doc string * removing trailing white spaces * removing trailing white spaces * changes as per review * change as per review. reverting changes to TextCorpus as discussed

menshikh-iv added difficulty easy Easy issue: required small fix feature Issue described a new feature good first issue Issue for new contributors (not required gensim understanding + very simple) labels Oct 19, 2017

roopalgarg mentioned this issue Nov 16, 2017

Add functionality in TextCorpus to convert document text to index vectors #1720

Merged

menshikh-iv closed this as completed in db3b881 Nov 20, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TextCorpus doesn't provide a way to convert document text to indices as needed for say DL NLP models #1634

TextCorpus doesn't provide a way to convert document text to indices as needed for say DL NLP models #1634

roopalgarg commented Oct 18, 2017

menshikh-iv commented Oct 19, 2017

roopalgarg commented Oct 19, 2017

roopalgarg commented Oct 28, 2017

TextCorpus doesn't provide a way to convert document text to indices as needed for say DL NLP models #1634

TextCorpus doesn't provide a way to convert document text to indices as needed for say DL NLP models #1634

Comments

roopalgarg commented Oct 18, 2017

Description

Steps/Code/Corpus to Reproduce

Expected Results

Actual Results

Versions

menshikh-iv commented Oct 19, 2017

roopalgarg commented Oct 19, 2017

roopalgarg commented Oct 28, 2017