Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Triton server #2088

Open
wants to merge 16 commits into
base: main
Choose a base branch
from
Open

[Feature] Triton server #2088

wants to merge 16 commits into from

Conversation

irexyc
Copy link
Collaborator

@irexyc irexyc commented May 18, 2023

Motivation

Support model serving

Modification

Add triton custom backend
Add demo

@codecov
Copy link

codecov bot commented May 18, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 49.67%. Comparing base (8e658cd) to head (fcdf52f).
Report is 92 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #2088   +/-   ##
=======================================
  Coverage   49.67%   49.67%           
=======================================
  Files         339      339           
  Lines       12998    12998           
  Branches     1906     1906           
=======================================
  Hits         6457     6457           
  Misses       6090     6090           
  Partials      451      451           
Flag Coverage Δ
unittests 49.67% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@irexyc irexyc changed the title Triton server [Feature] Triton server May 18, 2023
@RunningLeon RunningLeon requested a review from AllentDan May 23, 2023 02:16
@irexyc
Copy link
Collaborator Author

irexyc commented May 23, 2023

can temporarily use this docker image for testing

docker pull irexyc/mmmdeploy:triton-22.12

@Y-T-G
Copy link
Contributor

Y-T-G commented Oct 11, 2023

Hey, thanks for this. I wanted to know how do I correctly send multiple bboxes for keypoint-detection inference.

I created a dict for each bbox here and added to the value list, and used that, but the results are not accurate, although the number of keypoints returned matches the number of bboxes.

bbox_list = [{'bbox':bbox} for bbox in bboxes.tolist()]
bbox = {
    'type': 'PoseBbox',
    'value': bbox_list
}

@Y-T-G
Copy link
Contributor

Y-T-G commented Oct 14, 2023

Also, what does this mean only support batch dim 1 for single request? Does this mean the Triton version does not support batch inference?

@irexyc
Copy link
Collaborator Author

irexyc commented Oct 16, 2023

@Y-T-G

I created a dict for each bbox here and added to the value list, and used that, but the results are not accurate, although the number of keypoints returned matches the number of bboxes.

Cou you show the visualize result with bboxes? Are the inference result with single bbox looks right?

Also, what does this mean only support batch dim 1 for single request? Does this mean the Triton version does not support batch inference?

For batch inference of mmdeploy, you can refer to this #839 (comment)

Triton server support dynamic batcher and sequence batcher. But mmdeploy backend only support dynamic batcher. You can add these lines to config.pbtxt.

dynamic_batching {
  max_queue_delay_microseconds: 100
}

With allow_ragged_batch and dynamic_batching, mmdeploy backend can receive a batch of requests for each inference step (therefore, you don't have to construct normal batch input like b x c x h x w. you only need send c x h x w to triton server and let it to collect batch requests.)

In summary, to use mmdeploy triton backend with batch inference, you have to:

  1. convert the model with batch inference support and edit the pipeline.json
  2. add dynamic_batching to config.pbtxt

@Y-T-G
Copy link
Contributor

Y-T-G commented Oct 16, 2023

@irexyc

Cou you show the visualize result with bboxes? Are the inference result with single bbox looks right?

Yes with single bbox the inference is correct. But if I add more than one bbox, the outputs don't make any sense.

I only visualize the nose, left wrist and right wrist keypoints. This is from RTMPose.

This is how it looks when I add more than 1:
image

This is how it looks like when I do individually, cropping each bbox and sending each for inference separately:
image

The input for multiple bbox looks like this:

{
   "type":"PoseBbox",
   "value":[
      {
         "bbox":[
            866,
            47,
            896,
            101
         ]
      },
      {
         "bbox":[
            48,
            65,
            73,
            125
         ]
      },
      {
         "bbox":[
            425,
            32,
            447,
            97
         ]
      },
      ....
      ....
      ....
      ....
   ]
}

@Y-T-G
Copy link
Contributor

Y-T-G commented Oct 16, 2023

For batch inference of mmdeploy, you can refer to this #839 (comment)

Triton server support dynamic batcher and sequence batcher. But mmdeploy backend only support dynamic batcher. You can add these lines to config.pbtxt.

dynamic_batching {
  max_queue_delay_microseconds: 100
}

With allow_ragged_batch and dynamic_batching, mmdeploy backend can receive a batch of requests for each inference step (therefore, you don't have to construct normal batch input like b x c x h x w. you only need send c x h x w to triton server and let it to collect batch requests.)

In summary, to use mmdeploy triton backend with batch inference, you have to:

  1. convert the model with batch inference support and edit the pipeline.json
  2. add dynamic_batching to config.pbtxt

I am not sure if this works. I don't see any improvements when I do this after checking with perf_analyzer for ResNet18:

Inferences/Second vs. Client Average Batch Latency
Concurrency: 2, throughput: 41.6086 infer/sec, latency 47983 usec
Concurrency: 3, throughput: 41.8305 infer/sec, latency 71626 usec
Concurrency: 4, throughput: 41.775 infer/sec, latency 95672 usec
Concurrency: 5, throughput: 41.2752 infer/sec, latency 120931 usec
Concurrency: 6, throughput: 41.7747 infer/sec, latency 143440 usec
Concurrency: 7, throughput: 41.7748 infer/sec, latency 167467 usec
Concurrency: 8, throughput: 41.6641 infer/sec, latency 191807 usec

It supports batching in the json.

I can see better improvements by launching multiple model instances using:

instance_group [ 
  { 
    count: 4
    kind: KIND_GPU 
  }
]
Inferences/Second vs. Client Average Batch Latency
Concurrency: 2, throughput: 62.7163 infer/sec, latency 31838 usec
Concurrency: 3, throughput: 65.1612 infer/sec, latency 46015 usec
Concurrency: 4, throughput: 79.328 infer/sec, latency 50415 usec
Concurrency: 5, throughput: 84.3826 infer/sec, latency 59160 usec
Concurrency: 6, throughput: 90.2152 infer/sec, latency 66516 usec
Concurrency: 7, throughput: 89.4926 infer/sec, latency 78322 usec
Concurrency: 8, throughput: 88.104 infer/sec, latency 90731 usec

I think dynamic_batcher depends on sequence_batching. But since each request is handled separately in instance_state.cpp, dynamic_batching will not have any effect. To have an effect, the requests have to batched and then inferred all at once.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants