update ping endpoint default behavior #2254

lxning · 2023-04-21T06:09:05Z

Description

Please read our CONTRIBUTING.md prior to creating your first pull request.

Please include a summary of the feature or issue being fixed. Please also include relevant motivation and context. List any dependencies that are required for this change.

ping endpoint default behavior is changed as #2231 described.

Fixes #(issue)
#2231

Type of change

Please delete options that are not relevant.

Bug fix (non-breaking change which fixes an issue)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
New feature (non-breaking change which adds functionality)
This change requires a documentation update

Feature/Issue validation/testing

Please describe the Unit or Integration tests that you ran to verify your changes and relevant result summary. Provide instructions so it can be reproduced.
Please also list any relevant details for your test configuration.

Test A
Logs for Test A
Test B
Logs for Test B

Checklist:

Did you have fun?
Have you added tests that prove your fix is effective or that this feature works?
Has code been commented, particularly in hard-to-understand areas?
Have you made corresponding changes to the documentation?

codecov · 2023-04-21T06:30:05Z

Codecov Report

Merging #2254 (3786bf3) into master (419edb6) will increase coverage by 0.11%.
The diff coverage is n/a.

❗ Current head 3786bf3 differs from pull request most recent head 225eeee. Consider uploading reports for the commit 225eeee to get more accurate results

@@            Coverage Diff             @@
##           master    #2254      +/-   ##
==========================================
+ Coverage   70.28%   70.40%   +0.11%     
==========================================
  Files          75       75              
  Lines        3392     3392              
  Branches       57       57              
==========================================
+ Hits         2384     2388       +4     
+ Misses       1005     1001       -4     
  Partials        3        3

see 2 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

agunapal · 2023-04-21T20:21:08Z

frontend/server/src/main/java/org/pytorch/serve/wlm/WorkerThread.java

+            }
+        } else if (state == WorkerState.WORKER_STOPPED) {
+            if (recoveryStartTS == 0) {
+                recoveryStartTS = currentTS;


Is recoveryStartTS = currentTS; only valid in case of WorkerState.WORKER_STOPPED

What about WorkerState.WORKER_SCALED_DOWN?

Also, does the current logic handle the case when the thread is dying because of OOM

WORKER_SCALED_DOWN is used in model unregistration or scale down request. it does not trigger backend worker retry.

Any exception such as OOM will trigger worker stage changed to WorkerState.WORKER_STOPPED and then retry. That's why recoveryStartTS = currentTS only happen on WorkerState.WORKER_STOPPED

agunapal

Besides the one comment, LGTM

namannandan · 2023-04-21T21:13:37Z

docs/inference_api.md

@@ -41,6 +41,11 @@ If the server is running, the response is:
 }
 ```

+"maxRetryTimeoutInSec" (default: 5MIN) can be defined in a model's config yaml file(eg. model-config.yaml). It is the maximum time window of recovering a dead backend worker. A healthy worker can be in the state: WORKER_STARTED, WORKER_MODEL_LOADED, or WORKER_STOPPED within maxRetryTimeoutInSec window. "Ping" endpont"


Nit: Would it be better to name this config option maxRecoveryTimeoutInSec?

lxning and others added 2 commits April 20, 2023 23:06

precommit fmt

65add53

Merge branch 'master' into issue_2231

dddc412

lxning and others added 6 commits April 21, 2023 10:36

update test

f7216a2

Merge branch 'issue_2231' of github.com:pytorch/serve into issue_2231

1b25db7

Merge branch 'master' into issue_2231

31a8f95

fix test

e3b8062

Merge branch 'issue_2231' of github.com:pytorch/serve into issue_2231

b5b4417

Merge branch 'master' into issue_2231

b04c4f6

lxning requested review from agunapal and namannandan April 21, 2023 19:50

agunapal reviewed Apr 21, 2023

View reviewed changes

agunapal approved these changes Apr 21, 2023

View reviewed changes

namannandan approved these changes Apr 21, 2023

View reviewed changes

Merge branch 'master' into issue_2231

225eeee

lxning merged commit 03fbca6 into master Apr 21, 2023

public-git-ui mentioned this pull request Apr 24, 2023

[bug] PyTorch Image incorrectly responds to /ping requests during startup aws/deep-learning-containers#2917

Closed

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

update ping endpoint default behavior #2254

update ping endpoint default behavior #2254

lxning commented Apr 21, 2023 •

edited

Loading

codecov bot commented Apr 21, 2023 •

edited

Loading

agunapal Apr 21, 2023

lxning Apr 21, 2023

agunapal left a comment

namannandan Apr 21, 2023

update ping endpoint default behavior #2254

update ping endpoint default behavior #2254

Conversation

lxning commented Apr 21, 2023 • edited Loading

Description

Type of change

Feature/Issue validation/testing

Checklist:

codecov bot commented Apr 21, 2023 • edited Loading

Codecov Report

agunapal Apr 21, 2023

Choose a reason for hiding this comment

lxning Apr 21, 2023

Choose a reason for hiding this comment

agunapal left a comment

Choose a reason for hiding this comment

namannandan Apr 21, 2023

Choose a reason for hiding this comment

lxning commented Apr 21, 2023 •

edited

Loading

codecov bot commented Apr 21, 2023 •

edited

Loading