-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add functionality in TextCorpus to convert document text to index vectors #1720
Conversation
…dding test case for the changes
@menshikh-iv how does this look? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand this change for Dictionary, it's OK, but I didn't understand, why this changes needed for TextCorpus ( the main question is why only for him)
gensim/corpora/dictionary.py
Outdated
@@ -173,6 +173,37 @@ def doc2bow(self, document, allow_update=False, return_missing=False): | |||
else: | |||
return result | |||
|
|||
def doc2idx(self, document, unk_wrd_idx=0): | |||
""" | |||
Convert `document` (a list of words) into a list of indexes = list |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use numpy-style docstrings.
gensim/corpora/dictionary.py
Outdated
if isinstance(document, string_types): | ||
raise TypeError("doc2idx expects an array of unicode tokens on input, not a single string") | ||
|
||
token2id = self.token2id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can use self.token2id
directly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
was just following convention from 'doc2bow' but I will make the change as you pointed out.
gensim/corpora/dictionary.py
Outdated
@@ -173,6 +173,37 @@ def doc2bow(self, document, allow_update=False, return_missing=False): | |||
else: | |||
return result | |||
|
|||
def doc2idx(self, document, unk_wrd_idx=0): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dictionary always start numbering from 0, for this reason, index 0 always busy with some word, -1
is significantly better as the default value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also please rename unk_wrd_idx
to unknown_word_index
(here and everywhere)
gensim/corpora/dictionary.py
Outdated
|
||
token2id = self.token2id | ||
|
||
list_word_idx = list() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
document = [word if isinstance(word, unicode) else unicode(word, 'utf-8') for word in document]
return [self.token2id.get(word, unknown_word_index) for word in document]
gensim/corpora/textcorpus.py
Outdated
@@ -112,7 +114,10 @@ class TextCorpus(interfaces.CorpusABC): | |||
6. remove stopwords; see `gensim.parsing.preprocessing` for the list of stopwords | |||
|
|||
""" | |||
def __init__(self, input=None, dictionary=None, metadata=False, character_filters=None, tokenizer=None, token_filters=None): | |||
def __init__( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use vertical indent (only for method/function definition, in all other cases - hanging indent).
@menshikh-iv the idea was that since we are adding the functionality to the So couple of things here then:
|
@roopalgarg current problem is more "global", let me describe:
I think corpus classes needs global refactoring (bring everything to the same interfaces and simplify, i.e. a minimum of functionality), @roopalgarg it isn't your problem, sorry, but you reminded me about the old and important problem. @roopalgarg I'm ready to merge only new method for Dictionary right now. @piskvorky wdyt? corpuses in common are very chaotic, have any idea how to rework it? |
@menshikh-iv I see your point. For now just adding |
gensim/corpora/dictionary.py
Outdated
|
||
Notes | ||
----- | ||
This function is `const`, aka read-only |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No needed indentation here
gensim/corpora/dictionary.py
Outdated
|
||
Parameters | ||
---------- | ||
document : list |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
list of str
gensim/corpora/dictionary.py
Outdated
Parameters | ||
---------- | ||
document : list | ||
List of words tokenized, normalized and preprocessed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to mention type twice
gensim/corpora/dictionary.py
Outdated
|
||
Returns | ||
------- | ||
list |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
list of int
gensim/corpora/dictionary.py
Outdated
Returns | ||
------- | ||
list | ||
List of indexes in the dictionary for words in the `document` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to mention type twice + add preserves order
.
gensim/corpora/dictionary.py
Outdated
------- | ||
list | ||
List of indexes in the dictionary for words in the `document` | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add example section (simple example that works how to apply this method)
@roopalgarg yeah, please revert 2 files, fix docstring and that's all 👍 |
reverting changes to TextCorpus as discussed
@menshikh-iv a little new to numpy style docstrings so not fully aware of best practices. learnt something new today :) |
@menshikh-iv good to merge ? |
@roopalgarg yeah, thanks for your contribution:+1: |
@menshikh-iv awesome! thanks |
…iskvorky#1720) * define doc2idx to convert a document to a vector of indexes per the dictionary * update documentation * changes to textcorpus to add a mode for index vector format output. adding test case for the changes * fixing doc string * fix doc string * fix doc string * removing trailing white spaces * removing trailing white spaces * changes as per review * change as per review. reverting changes to TextCorpus as discussed
TextCorpus doesn't provide a way to convert document text to index vector as needed for say DL NLP models.
Adding a 'doc2idx' to Dictionary object and creating modes in TextCorpus to leverage this functionality.
Referencing issue #1634