Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add functionality in TextCorpus to convert document text to index vectors #1720

Merged
merged 10 commits into from
Nov 20, 2017

Conversation

roopalgarg
Copy link
Contributor

TextCorpus doesn't provide a way to convert document text to index vector as needed for say DL NLP models.
Adding a 'doc2idx' to Dictionary object and creating modes in TextCorpus to leverage this functionality.
Referencing issue #1634

@roopalgarg
Copy link
Contributor Author

@menshikh-iv how does this look?

Copy link
Contributor

@menshikh-iv menshikh-iv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand this change for Dictionary, it's OK, but I didn't understand, why this changes needed for TextCorpus ( the main question is why only for him)

@@ -173,6 +173,37 @@ def doc2bow(self, document, allow_update=False, return_missing=False):
else:
return result

def doc2idx(self, document, unk_wrd_idx=0):
"""
Convert `document` (a list of words) into a list of indexes = list
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use numpy-style docstrings.

if isinstance(document, string_types):
raise TypeError("doc2idx expects an array of unicode tokens on input, not a single string")

token2id = self.token2id
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can use self.token2id directly

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was just following convention from 'doc2bow' but I will make the change as you pointed out.

@@ -173,6 +173,37 @@ def doc2bow(self, document, allow_update=False, return_missing=False):
else:
return result

def doc2idx(self, document, unk_wrd_idx=0):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dictionary always start numbering from 0, for this reason, index 0 always busy with some word, -1 is significantly better as the default value.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also please rename unk_wrd_idx to unknown_word_index (here and everywhere)


token2id = self.token2id

list_word_idx = list()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

document = [word if isinstance(word, unicode) else unicode(word, 'utf-8') for word in document]
return [self.token2id.get(word, unknown_word_index) for word in document]

@@ -112,7 +114,10 @@ class TextCorpus(interfaces.CorpusABC):
6. remove stopwords; see `gensim.parsing.preprocessing` for the list of stopwords

"""
def __init__(self, input=None, dictionary=None, metadata=False, character_filters=None, tokenizer=None, token_filters=None):
def __init__(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use vertical indent (only for method/function definition, in all other cases - hanging indent).

@roopalgarg
Copy link
Contributor Author

@menshikh-iv the idea was that since we are adding the functionality to the __iter__ of TextCorpus, we can process a corpus to get vocabulary indexes of the corpus. Also since WikiCorpus inherits TextCorpus and doesnt override the __iter__ the functionality gets added to WikiCorpus as well.
Though I agree that the constructor for WikiCorpus doesnt explicitly take parameters for this new functionality, but we can set the parameters after creating an instance of the class by explicitly setting the new parameters.

So couple of things here then:

  1. Should WikiCorpus have the parameters explicitly as well as part of its constructor definition?
  2. Where else do you think this functionality should be added since you were asking as to why only TextCorpus?

@menshikh-iv
Copy link
Contributor

@roopalgarg current problem is more "global", let me describe:

  • We have many different classes for corpuses, we need to add it for all corpuses (but I don't think that this is a good idea, but have same interfaces will be nice).
  • How to choose the concrete class for support this Dictionary feature (maybe we no need to add it to any corpuses, only new method for Dictionary?)
  • Passing arguments (if we want to add it in many places)

I think corpus classes needs global refactoring (bring everything to the same interfaces and simplify, i.e. a minimum of functionality), @roopalgarg it isn't your problem, sorry, but you reminded me about the old and important problem.

@roopalgarg I'm ready to merge only new method for Dictionary right now.

@piskvorky wdyt? corpuses in common are very chaotic, have any idea how to rework it?

@roopalgarg
Copy link
Contributor Author

@menshikh-iv I see your point. For now just adding doc2idx to the Dictionary class sounds good to me.
Do I need to revert my changes to the other 2 files (or there is a way you can discard the changes to them)?


Notes
-----
This function is `const`, aka read-only
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No needed indentation here


Parameters
----------
document : list
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

list of str

Parameters
----------
document : list
List of words tokenized, normalized and preprocessed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to mention type twice


Returns
-------
list
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

list of int

Returns
-------
list
List of indexes in the dictionary for words in the `document`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to mention type twice + add preserves order.

-------
list
List of indexes in the dictionary for words in the `document`

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add example section (simple example that works how to apply this method)

@menshikh-iv
Copy link
Contributor

@roopalgarg yeah, please revert 2 files, fix docstring and that's all 👍

reverting changes to TextCorpus as discussed
@roopalgarg
Copy link
Contributor Author

@menshikh-iv a little new to numpy style docstrings so not fully aware of best practices. learnt something new today :)

@roopalgarg
Copy link
Contributor Author

@menshikh-iv good to merge ?

@menshikh-iv
Copy link
Contributor

@roopalgarg yeah, thanks for your contribution:+1:

@menshikh-iv menshikh-iv merged commit db3b881 into piskvorky:develop Nov 20, 2017
@roopalgarg
Copy link
Contributor Author

@menshikh-iv awesome! thanks

VaiyeBe pushed a commit to VaiyeBe/gensim that referenced this pull request Nov 26, 2017
…iskvorky#1720)

* define doc2idx to convert a document to a vector of indexes per the dictionary

* update documentation

* changes to textcorpus to add a mode for index vector format output. adding test case for the changes

* fixing doc string

* fix doc string

* fix doc string

* removing trailing white spaces

* removing trailing white spaces

* changes as per review

* change as per review.

reverting changes to TextCorpus as discussed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants