-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding type check for corpus_file argument #2469
Merged
Merged
Changes from 1 commit
Commits
Show all changes
16 commits
Select commit
Hold shift + click to select a range
40989ba
adding type check for corpus_file argument
saraswatmks ad76b83
fixes to handle different typeerror in train parameters, adding unitt…
saraswatmks e1f32a2
adding doc2vec with more typeerror checks
saraswatmks a9eeddf
fixing lint errors
saraswatmks 6639089
removing f-string use
saraswatmks 2ca51ca
fixes as suggested
saraswatmks 6458aea
remove unused imports
saraswatmks bc9dce6
using xor as suggested
saraswatmks 97d5619
minor fixes
saraswatmks cdaff9f
only check for iterable
saraswatmks b046292
minor fix - 2
saraswatmks 298e8f0
checking corpus_file path, removing xor
saraswatmks 4fb01f4
fixing nitpiks
saraswatmks 07a7d5d
parameters check in fasttext module
saraswatmks 8821b92
extra space fix
saraswatmks bfc4360
remove comments
saraswatmks File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -794,6 +794,11 @@ def train(self, documents=None, corpus_file=None, total_examples=None, total_wor | |
|
||
""" | ||
kwargs = {} | ||
|
||
# Check the type of corpus_file | ||
if not isinstance(corpus_file, string_types): | ||
raise TypeError("Parameter corpus_file of train() must be a string (path to a file).") | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you also show what the received parameter is instead? Having concrete error messages helps avoid confusion. |
||
|
||
if corpus_file is not None: | ||
# Calculate offsets for each worker along with initial doctags (doctag ~ document/line number in a file) | ||
offsets, start_doctags = self._get_offsets_and_start_doctags_for_corpusfile(corpus_file, self.workers) | ||
|
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
corpus_file
may legitimately be None, whendocuments
is not None. This is the reason why some of the unit tests fail. Please have a look at Travis CI.So, if you go ahead with your proposed check, you need something like:
Also, please add a unit test that stresses your new functionality (pass in a non-string
corpus_file
, expect aTypeError
raised).Next, what is the benefit of raising TypeError here? What happens with the existing code if we do not raise TypeError? What kind of exception does the user see?
I think if we go ahead with this kind of parameter checking, we should do it properly:
documents
orcorpus_file
is None (they cannot both be non-None)documents
orcorpus_file
is not None (they cannot both be None)documents
is not None, then it must be an iterablecorpus_file
is not None, then it must be a stringFinally, if we go ahead with this, we should apply this consistently everywhere, not just doc2vec (e.g. fasttext has similar issues).
My question is: is it worth it? @piskvorky
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed on the tests. I edited the PR description for context (link to the original issue).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@piskvorky @mpenkov Thanks for the detailed feedback. Working on it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also I'd say: while checking whether it's a string may help a bit, if there's checking, it'd make sense to ensure things like illegal or missing paths also generate meaningful error messages. It might be possible to just leverage some existing path-checking/file-existence-checking method, or wrap the failing method(s) in a handler that catches errors and shows a good-enough error message like "Problem with
corpus_file
value X".There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gojomo so should I check
corpus_file
for being a valid file path instead of checking forstring_type
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure, what covers the most cases with the most straightforward code? A path-test, with the right catches/error-message, might "catch N birds with one stone", pointing the user in the right direction to understand their parameter error with fewer conditionals/lines-of-tests. Or it might complicate things compared to the simple proper-type test. But the motivating case for this change - that users following slightly-older examples would get a confusing error message, due to a new
corpus_file
parameter` – will actually already have been handled by the simple "one or the other but not both" test. So, even the string test isn't strictly required to address the motivating case. My pref would be: do the simplest thing that resolves the motivating case, for sure. If there's a easy/clear bit of extra checking that makes also makes sense, consider it as well.