-
Notifications
You must be signed in to change notification settings - Fork 863
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Log entire stdout and stderr for terminated backend worker process #3036
Log entire stdout and stderr for terminated backend worker process #3036
Conversation
It seems that the following code does not address the root cause listed in the description section
|
129b326
to
2d07544
Compare
2d07544
to
b266046
Compare
I believe it does address the root cause. Added a regression test to confirm the behavior: https://github.com/pytorch/serve/pull/3036/files#diff-b4f49147fd772129dbf21c64a502a950c5cb76a192a76de8df4e4d2949c6816f Without the fix in this PR (Notice that only a part of the traceback is logged causing the test to fail):
With the fix in this PR:
|
b266046
to
b15ffce
Compare
b15ffce
to
a52c6ee
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see comments, overall LGTM
@@ -380,10 +380,14 @@ public void terminate() { | |||
@Override | |||
public void run() { | |||
try (Scanner scanner = new Scanner(is, StandardCharsets.UTF_8.name())) { | |||
while (isRunning.get() && scanner.hasNext()) { | |||
while (scanner.hasNextLine()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume that hasNextLine() will return True for the input stream as long as the process is alive even though the stream does not contain new input?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe hasNextLine() will block until the next line is available and return true
once a new line of input is available. It will return false
once EOF
is encountered or underlying stream has been closed.
break | ||
|
||
for line in traceback: | ||
assert any(line in log for log in logs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be:
assert all(line in logs for line in traceback)
any will stop after the first line is found
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
line in logs
would not work because the logs include other details such as timestamp and log level along with the traceback line which cannot be hardcoded in the test, so I believe the logic would have to be line in log for log in logs
.
Agree that any will stop after the first line is found
, which is why I loop over each of the traceback lines I expect to find in the captured logs:
for line in traceback:
assert any(line in log for log in logs)
Test with the changes on this PR to validate that the
As can be seen from the above logs, the reader thread exits from the
serve/frontend/server/src/main/java/org/pytorch/serve/wlm/WorkerLifeCycle.java Lines 382 to 383 in 1ec8932
serve/frontend/server/src/main/java/org/pytorch/serve/wlm/WorkerLifeCycle.java Lines 449 to 457 in 1ec8932
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Description
The current implementation of reader threads to consume
stdout
andstderr
from the backend worker process are terminated on worker failure before the existingScanner
buffer is drained:serve/frontend/server/src/main/java/org/pytorch/serve/wlm/WorkerLifeCycle.java
Lines 383 to 384 in 1994aa0
This causes loss of data and prevents the entire logs(error traceback in case of backend worker crash) from being logged.
Consider a handler with a deliberate bug as follows:
On loading the model with the current implementation on
master
branch:(Notice that the entire backend worker error traceback is logged only for the first attempt to start the worker. For subsequent attempts only a portion of the traceback is logged)
On loading the model with the fix in this PR:
(Notice that the entire backend worker error traceback is logged for every worker restart)
Fixes #3034
Type of change
Feature/Issue validation/testing