Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add job queue status to describe API #2464

Merged
merged 5 commits into from
Aug 2, 2023

Conversation

namannandan
Copy link
Collaborator

@namannandan namannandan commented Jul 13, 2023

Description

Include job queue status in the describe API response:

  • Remaining capacity
  • Pending requests
  • Consecutive failed requests
curl http://localhost:8081/models/noop
[
    {
      "modelName": "noop",
      "modelVersion": "1.0",
      "modelUrl": "noop.mar",
      "engine": "Torch",
      "runtime": "python",
      "minWorkers": 1,
      "maxWorkers": 1,
      "batchSize": 1,
      "maxBatchDelay": 100,
      "workers": [
        {
          "id": "9000",
          "startTime": "2018-10-02T13:44:53.034Z",
          "status": "READY",
          "gpu": false,
          "memoryUsage": 89247744
        }
      ],
      "jobQueueStatus": {
        "remainingCapacity": 100,
        "pendingRequests": 0
      }
    }
]

Fixes #2101, #2412

Type of change

  • New feature (non-breaking change which adds functionality)
  • This change requires a documentation update

Feature/Issue validation/testing

  • Unit test added in this PR
  • Sanity and regression tests in CI

Checklist:

  • Did you have fun?
  • Have you added tests that prove your fix is effective or that this feature works?
  • Has code been commented, particularly in hard-to-understand areas?
  • Have you made corresponding changes to the documentation?

@codecov
Copy link

codecov bot commented Jul 13, 2023

Codecov Report

Merging #2464 (2320505) into master (fdcc223) will not change coverage.
The diff coverage is n/a.

❗ Current head 2320505 differs from pull request most recent head c40005e. Consider uploading reports for the commit c40005e to get more accurate results

@@           Coverage Diff           @@
##           master    #2464   +/-   ##
=======================================
  Coverage   72.66%   72.66%           
=======================================
  Files          78       78           
  Lines        3669     3669           
  Branches       58       58           
=======================================
  Hits         2666     2666           
  Misses        999      999           
  Partials        4        4           

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@namannandan namannandan marked this pull request as ready for review July 13, 2023 18:15
Copy link
Collaborator

@lxning lxning left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't break existing description attributes.

@namannandan
Copy link
Collaborator Author

namannandan commented Jul 13, 2023

Please don't break existing description attributes.

@lxning currently we don't publish any data/metrics related to job queue in the describe API endpoint. Now, I've included an additional field called jobQueueStatus. Also, I changed the implementation name from metrics to jobQueueStatus since this may be confusing given we already have a separate metrics endpoint. Do you suggest creating a separate endpoint to fetch jobQueueStatus?

Existing implementation:

curl http://localhost:8081/models/noop
[
    {
      "modelName": "noop",
      "modelVersion": "1.0",
      "modelUrl": "noop.mar",
      "engine": "Torch",
      "runtime": "python",
      "minWorkers": 1,
      "maxWorkers": 1,
      "batchSize": 1,
      "maxBatchDelay": 100,
      "workers": [
        {
          "id": "9000",
          "startTime": "2018-10-02T13:44:53.034Z",
          "status": "READY",
          "gpu": false,
          "memoryUsage": 89247744
        }
      ]
    }
]

With the changes in this PR:

curl http://localhost:8081/models/noop
[
    {
      "modelName": "noop",
      "modelVersion": "1.0",
      "modelUrl": "noop.mar",
      "engine": "Torch",
      "runtime": "python",
      "minWorkers": 1,
      "maxWorkers": 1,
      "batchSize": 1,
      "maxBatchDelay": 100,
      "workers": [
        {
          "id": "9000",
          "startTime": "2018-10-02T13:44:53.034Z",
          "status": "READY",
          "gpu": false,
          "memoryUsage": 89247744
        }
      ],
      "jobQueueStatus": {
        "remainingCapacity": 100,
        "pendingRequests": 0,
        "consecutiveFailedRequests": 0
      }
    }
]

@namannandan namannandan requested a review from lxning July 13, 2023 19:05
Comment on lines 145 to 150
public JobQueueStatus getJobQueueStatus() {
return jobQueueStatus;
}

public void setMetrics(Metrics metrics) {
this.metrics = metrics;
public void setJobQueueStatus(JobQueueStatus jobQueueStatus) {
this.jobQueueStatus = jobQueueStatus;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TS provide SDK to allow cx to customize the endpoint response. It will break cx's customized plugin when class Metrics and func getMetrics/etMetrics are moved.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated implementation to retain the Metrics class definition as is.

@agunapal agunapal merged commit 53ff6a5 into pytorch:master Aug 2, 2023
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

monitor model queue status
3 participants