-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added function to replace gensim check_output with subprocess.check_output #1182
Conversation
@kirit93 Thanks for the PR. Could you please replace the actual There was some reason I added extra |
@tmylk I ran the tests for mallet, dtm, fasttext and wordrank wrappers. All pass except for fasttext, but the error I get with my
I'll get started with the rest of the things you mentioned. |
Looks like a bad test -- floats should never be compared for bit equality. Instead, there's |
Should I fix those tests?
I'll change it to :
I made the change and ran the test and there are no failures, the assertEqual is used in a couple of other places in this test. If this fix is okay, I'll make the changes everywhere. |
@kirit93 Confirm that using Does the error output get printed to the log with this new change? |
@tmylk |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please address comments about exception handling
gensim/utils.py
Outdated
raise | ||
except: | ||
error = "Error in check_output while trying to execute: \n ' " + str(args) + " '\nthis command was not found" | ||
print(error) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please output to the log
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a particular log file I should be logging output to? If not STDOUT, what is the right way to send out the error message? In the earlier function the output and error were being returned from the function. Should I do the same? I could output it to STDERR as the error generated by the shell when it tries executing the incorrect command is also written to STDERR.
gensim/utils.py
Outdated
process.terminate() | ||
raise | ||
except: | ||
error = "Error in check_output while trying to execute: \n ' " + str(args) + " '\nthis command was not found" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could there be exceptions even when command is found? if yes, then please remove "command was not found"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I overlooked that detail. Should I change the error message to something more generic like " xyz command could not be executed "?
gensim/utils.py
Outdated
except KeyboardInterrupt: | ||
process.terminate() | ||
raise |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the purpose of catching and immediately throwing the exception? this line doesn't add anything as is
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It makes sure the catch-all except:
below doesn't apply to KeyboardInterrupt
.
But it's not a good practice anyway, because there's also SystemExit
etc. At least limit the catch-all to Exception
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be a good idea to catch the subprocess.CalledProcessError
and output an error message that is easy for the user to understand? This is the error raised when check_output
fails.
For any other error including KeyboardInterrupt
I can use a catch-all Exception
and raise it.
gensim/utils.py
Outdated
except KeyboardInterrupt: | ||
process.terminate() | ||
raise |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It makes sure the catch-all except:
below doesn't apply to KeyboardInterrupt
.
But it's not a good practice anyway, because there's also SystemExit
etc. At least limit the catch-all to Exception
.
gensim/utils.py
Outdated
@@ -1176,4 +1180,4 @@ def sample_dict(d, n=10, use_random=True): | |||
according to natural dict iteration. | |||
""" | |||
selected_keys = random.sample(list(d), min(len(d), n)) if use_random else itertools.islice(iterkeys(d), n) | |||
return [(key, d[key]) for key in selected_keys] | |||
return [(key, d[key]) for key in selected_keys] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unwanted change.
gensim/utils.py
Outdated
raise | ||
except: | ||
error = "Error in check_output while trying to execute: \n ' " + str(args) + " '\nthis command was not found" | ||
print(error) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original exception was more useful, because it included stdout/stderr, which useful for error debugging.
Is there any way to do the same here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sorry, I didn't fully understand what you mean here. Which original exception are you talking about?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant subprocess.CalledProcessError
, a few lines above.
I don't have any strong opinion on what this function should do on failure, or how to capture the subprocess stdout/stderr. Except for sure we don't want to be printing anything -- logging preferred.
gensim/utils.py
Outdated
In case args generates an error | ||
>>> test_checkoutput(args=['/usr/bin/pythons', '-ve']) #Incorrect argument | ||
/bin/sh: /usr/bin/pythons: No such file or directory | ||
*Error in args : /usr/bin/pythons -ve |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this something that is printed? Or logged?
In general, we don't want to pollute stdout at all, unless the user explicitly requested printing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The No such file or directory
line gets outputted to STDERR as it is the error returned by the shell when it tries to execute the command in args
.
The Error in args: /usr/bin/pythons -ve
message gets printed to STDOUT because I'm using a simple print
command. Is there a log file you'd like me to log the output to or should I print to STDERR if we don't want STDOUT polluted?
gensim/utils.py
Outdated
Instead of raising the error, output a more specific error message | ||
""" | ||
error = "subprocess.check_output could not execute command ' " + str(args) + " '" | ||
print(error, file=sys.stderr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please log instead of printing, see examples in the code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please output to log instead of printing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add tests for the exception handling
gensim/utils.py
Outdated
error = "subprocess.check_output could not execute command ' " + str(args) + " '" | ||
print(error, file=sys.stderr) | ||
return error | ||
except Exception: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this code is redundant. It will be raised anyway
gensim/test/test_utils.py
Outdated
|
||
class TestOutput(unittest.TestCase): | ||
def test_check_output(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a test that exception is raised
gensim/test/test_utils.py
Outdated
@@ -97,6 +97,10 @@ def test_check_output(self): | |||
res = utils.check_output(args=["echo", "hello"]) | |||
self.assertEqual(res, b'hello\n') | |||
|
|||
def test_check_output_exception(self): | |||
error = utils.check_output(args=['ldfs']) | |||
self.assertEqual(error, "subprocess.check_output could not execute command ' ldfs '") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test_utils.py
imports utils.py
from the installed gensim
. So this test case is still running the old version of check_output
and therefore the exception test fails. However, if you change the import to use the current version of utils.check_output
, all tests pass.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can install your new version of gensim with python setup.py install
Please add unit test that exception is indeed raised. |
… replaced deprecated assertEquals with assertEqual
gensim/utils.py
Outdated
""" | ||
error = "subprocess.check_output could not execute command ' " + str(args) + " '" | ||
logger.error(error) | ||
return error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the reason to return
instead of previous raise
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason I've chosen to simple return
the error message is because raise
results in the following large and not user friendly message.
Traceback (most recent call last):
File "test.py", line 51, in test_check_out
res = subprocess.check_output(args, shell=flag)
File "/Users/kirit/anaconda/lib/python3.5/subprocess.py", line 626, in check_output
**kwargs).stdout
File "/Users/kirit/anaconda/lib/python3.5/subprocess.py", line 708, in run
output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command 'ldfs' returned non-zero exit status 127
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "test.py", line 76, in <module>
test_check_out(args=["ldfs"])
File "test.py", line 60, in test_check_out
raise error
TypeError: exceptions must derive from BaseException
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Raising a string (instead of some exception) is definitely not a good idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How would we benefit from raising an exception here? Once the subprocess.CalledProcessError
occurs and is caught, an appropriate message is returned to the user indicating why the check_output
command failed. Wouldn't that be enough for the user to proceed correctly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your raise error
already raises an exception. It's just that proper exceptions inherit from Exception
(or appropriate subclasses), rather than being just strings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could do a raise Exception(error)
.
Would that be okay?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't looked into the actual logic. Just commenting on the discussion above, where you say The reason I've chosen to simple return the error message is because raise results in the following large and not user friendly message.
, but that message is only due to raising a string instead of a proper exception :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My 2c would be to use the same error-handling logic that the function employed previously, before your changes. The least surprise to users.
@tmylk @piskvorky I've modified the function to handle errors just as the earlier function would. I hope this is okay. Please let me know if there are any further changes to be made so that the PR can be merged. Thanks! |
gensim/utils.py
Outdated
Instead of raising the error, output a more specific error message | ||
""" | ||
error = "subprocess.check_output could not execute command ' " + str(args) + " '" | ||
logger.error(error) | ||
raise |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please raise the CalledProcessError exception as before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I may be missing something, but why exactly do we need to raise the CalledProcessError
at all. As long as we tell the user why check_output
failed, shouldn't that be fine? Using shell=true
with subprocess.check_output
, in case of erroneous input, the same error message gets displayed as it would if the user were to enter the command on the terminal. This serves the purpose of telling the user why check_output
failed.
If I pass ldfs
to check_output
, stderr will have /bin/sh: ldfs: command not found
. Similarly, if you were to type the same command on the terminal, the output would be bash: ldfs: command not found
.
In the earlier code stderr would have had a huge error message indicating a FileNotFound
. If required CalledProcessError
can be captured and a similar error message can be logged.
In case I've missed something and there is some reason that an implementation similar to the earlier one is preferred, I will look into that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First of all we need the actual output to be written to log, not to stderr.
Secondly, the tests should check that the return message is indeed for ldfs: command not found
. The current message they test for is not informative
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be okay for the log to have the error message as follows
/bin/sh: ldfs: command not found
Command '['ldfs']' returned non-zero exit status 127.
along with a subprocess.CalledProcessError
being raised?
In the test case, I'm unable to figure out how to use assertRaises
to verify that it was in fact a subprocess.CalledProcessError
that was raised.
It seems that tests didn't run for the last commit, Is it correct that the only difference between old code and this improvement is that the error is printed to log here:
However this doesn't answer the original purpose of the issue in #703 :
Please add code and tests that actually do that. |
@tmylk, could you check out my latest push and let me know if that works for us? |
@menshikh-iv, could you let me know if this PR is okay to merge? |
gensim/utils.py
Outdated
Python 2.6.2 | ||
Added extra KeyboardInterrupt handling | ||
def check_output(args, flag=True): | ||
r""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not required r
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
gensim/utils.py
Outdated
>>> check_output(args=['/usr/bin/python', '--version']) | ||
Python 2.6.2 | ||
Added extra KeyboardInterrupt handling | ||
def check_output(args, flag=True): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A new interface breaks down many wrappers from gensim. First, please merge develop to your branch (it needs to correct test run).
Second, run tests locally with all wrappers like this
FT_HOME=/home/ivan/release/test/fastText WR_HOME=/home/ivan/release/test/wordrank VOWPAL_WABBIT_PATH=/home/ivan/release/test/vowpal_wabbit/vowpalwabbit/vw DTM_PATH=/home/ivan/release/test/dtm/dtm/main MALLET_HOME=/home/ivan/release/test/Mallet python setup.py test
Another variant is PR#1368 docker image. You can use it for full test run
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My branch is upto date with develop. I ran python setup.py test
with the appropriate env variable. Two tests failed but these are tests unrelated to what I am working on.
What should I do now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kirit93 hm, strange, I'll run all the tests and write the results here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@menshikh-iv, It seems like the travis build succeeded after develop was merged into issue. Is the PR good to go?
I run all tests and get 39 errors, but they are all similar to each other: First type
Second type
Third type
For this reason, I can't merge your PR. |
@menshikh-iv for the second two error types I can fix it by changing the definition of my This way the way the function is called will not have to be changed in the other files in the code. And the first type of error is not related to |
@kirit93 First type error is strange, I will investigate it later. Could you please do a few things:
After this fixes, I will run all tests again and write a result to this PR. |
@menshikh-iv I made the changes you requested and pushed them. Travis built successfully. Please let me know if the PR is okay now. |
Sorry for late response @kirit93, Now, I get a new kind of error, look at command
|
@menshikh-iv, I reverted to the old implementation of
Could the cause of the error be something unrelated to the |
@kirit93 unless your system user is |
@piskvorky, I ran the following command |
@kirit93 Where did you get that command from, do those paths really exist? |
No they don't exist on my system, but @menshikh-iv asked me to run that command to test whether the PR works or not. |
@kirit93 It's a path in my filesystem. For the full test run, you should:
|
@kirit93 I just ran all the tests on the development branch and made sure that everything works with the implementation from |
@menshikh-iv, any idea what could be the cause of this error? |
In my opinion, it is now easier to recreate PR and slightly change the code from |
@tmylk I've added a function above the existing
check_output
function in an attempt to solve this issue. Please let me know if there's anything I need to change. This is with reference to issue #703