Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding type check for corpus_file argument #2469

Merged
merged 16 commits into from
May 5, 2019
5 changes: 5 additions & 0 deletions gensim/models/doc2vec.py
Original file line number Diff line number Diff line change
Expand Up @@ -794,6 +794,11 @@ def train(self, documents=None, corpus_file=None, total_examples=None, total_wor

"""
kwargs = {}

# Check the type of corpus_file
if not isinstance(corpus_file, string_types):
Copy link
Collaborator

@mpenkov mpenkov Apr 30, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

corpus_file may legitimately be None, when documents is not None. This is the reason why some of the unit tests fail. Please have a look at Travis CI.

So, if you go ahead with your proposed check, you need something like:

Suggested change
if not isinstance(corpus_file, string_types):
if corpus_file is not None and not isinstance(corpus_file, string_types):

Also, please add a unit test that stresses your new functionality (pass in a non-string corpus_file, expect a TypeError raised).

Next, what is the benefit of raising TypeError here? What happens with the existing code if we do not raise TypeError? What kind of exception does the user see?

I think if we go ahead with this kind of parameter checking, we should do it properly:

  1. Ensure that one of documents or corpus_file is None (they cannot both be non-None)
  2. Ensure that one of documents or corpus_file is not None (they cannot both be None)
  3. If documents is not None, then it must be an iterable
  4. If corpus_file is not None, then it must be a string

Finally, if we go ahead with this, we should apply this consistently everywhere, not just doc2vec (e.g. fasttext has similar issues).

My question is: is it worth it? @piskvorky

Copy link
Owner

@piskvorky piskvorky Apr 30, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed on the tests. I edited the PR description for context (link to the original issue).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@piskvorky @mpenkov Thanks for the detailed feedback. Working on it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I'd say: while checking whether it's a string may help a bit, if there's checking, it'd make sense to ensure things like illegal or missing paths also generate meaningful error messages. It might be possible to just leverage some existing path-checking/file-existence-checking method, or wrap the failing method(s) in a handler that catches errors and shows a good-enough error message like "Problem with corpus_file value X".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gojomo so should I check corpus_file for being a valid file path instead of checking for string_type ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure, what covers the most cases with the most straightforward code? A path-test, with the right catches/error-message, might "catch N birds with one stone", pointing the user in the right direction to understand their parameter error with fewer conditionals/lines-of-tests. Or it might complicate things compared to the simple proper-type test. But the motivating case for this change - that users following slightly-older examples would get a confusing error message, due to a new corpus_file parameter` – will actually already have been handled by the simple "one or the other but not both" test. So, even the string test isn't strictly required to address the motivating case. My pref would be: do the simplest thing that resolves the motivating case, for sure. If there's a easy/clear bit of extra checking that makes also makes sense, consider it as well.

raise TypeError("Parameter corpus_file of train() must be a string (path to a file).")
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also show what the received parameter is instead?

Having concrete error messages helps avoid confusion.


if corpus_file is not None:
# Calculate offsets for each worker along with initial doctags (doctag ~ document/line number in a file)
offsets, start_doctags = self._get_offsets_and_start_doctags_for_corpusfile(corpus_file, self.workers)
Expand Down