TextCorpus doesn't provide a way to convert document text to indices as needed for say DL NLP models #1634
Labels
difficulty easy
Easy issue: required small fix
feature
Issue described a new feature
good first issue
Issue for new contributors (not required gensim understanding + very simple)
Description
TextCorpus doesn't provide a way to convert text in a document to indices per the dictionary as needed for say Deep Learning NLP models. TextCorpus uses Dictionary objects
doc2bow
function which is great for most ML models but for DL models where we need sequential indices for text its not usable in most cases.Steps/Code/Corpus to Reproduce
sample.txt
code:
Expected Results
Some way to simply convert the corpus to indices per the token2id dict object in Dictionary class, also adding in option to provide an unknown token id which replaces all unknown tokens.
So either adding a
doc2idx()
inTextCorpus
or integrating that inDictionary
class along withdoc2bow()
[2, 1, 4, 5, 6]
[0, 0, 7]
Actual Results
[(1, 1), (2, 1), (4, 1), (5, 1), (6, 1)]
[(7, 1)]
Versions
Darwin-16.4.0-x86_64-i386-64bit
('Python', '2.7.12 |Anaconda custom (x86_64)| (default, Jul 2 2016, 17:43:17) \n[GCC 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2336.11.00)]')
('NumPy', '1.13.1')
('SciPy', '0.19.1')
The text was updated successfully, but these errors were encountered: