Skip to content

Commit

Permalink
Merge branch 'master' into issue_205
Browse files Browse the repository at this point in the history
  • Loading branch information
dhanainme authored Apr 25, 2020
2 parents fc3a11d + 719add3 commit 6642df0
Show file tree
Hide file tree
Showing 12 changed files with 382 additions and 301 deletions.
6 changes: 3 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -104,8 +104,8 @@ This section shows a simple example of serving a model with TorchServe. To compl
To run this example, clone the TorchServe repository and navigate to the root of the repository:

```bash
cd ~
git clone https://github.com/pytorch/serve.git
cd serve
```

Then run the following steps from the root of the repository.
Expand Down Expand Up @@ -134,14 +134,14 @@ You can also create model stores to store your archived models.
torch-model-archiver --model-name densenet161 --version 1.0 --model-file ~/serve/examples/image_classifier/densenet_161/model.py --serialized-file ~/model_store/densenet161-8d451a50.pth --extra-files ~/serve/examples/image_classifier/index_to_name.json --handler image_classifier
```

For more information about the model archiver, see [Torch Model archiver for TorchServe](../model-archiver/README.md)
For more information about the model archiver, see [Torch Model archiver for TorchServe](model-archiver/README.md)

### Start TorchServe to serve the model

After you archive and store the model, use the `torchserve` command to serve the model.

```bash
torchserve --start --model-store ~/model_store --models ~/model_store/densenet161=densenet161.mar
torchserve --start --model-store ~/model_store --models ~/model_store/densenet161.mar
```

After you execute the `torchserve` command above, TorchServe runs on your host, listening for inference requests.
Expand Down
111 changes: 70 additions & 41 deletions docs/batch_inference_with_ts.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
# Batch Inference with TorchServe

## Contents of this Document

* [Introduction](#introduction)
* [Prerequisites](#prerequisites)
* [Batch Inference with TorchServe's default handlers](#batch-inference-with-torchserves-default-handlers)
Expand All @@ -9,50 +10,66 @@

## Introduction

Batching in the Machine-Learning/Deep-Learning is a process of aggregating inference-requests and sending this aggregated requests through the ML/DL framework for inference at once.
TorchServe was designed to natively support batching of incoming inference requests. This functionality provides customer using TorchServe to optimally utilize their host resources, because most ML/DL frameworks
are optimized for batch requests. This optimal utilization of host resources in turn reduces the operational expense of hosting an inference service using TorchServe. In this document we will go through an example of how this is done
and compare the performance of running a batched inference against running single inference.
Batch inference is a process of aggregating inference requests and sending this aggregated requests through the ML/DL framework for inference all at once.
TorchServe was designed to natively support batching of incoming inference requests. This functionality enables you to use your host resources optimally,
because most ML/DL frameworks are optimized for batch requests.
This optimal use of host resources in turn reduces the operational expense of hosting an inference service using TorchServe.
In this document we show an example of how this is done and compare the performance of running a batched inference against running single inference.

## Prerequisites

Before jumping into this document, read the following docs:

## Prerequisites:
Before jumping into this document, please go over the following docs
1. [What is TorchServe?](../README.md)
1. [What is custom service code?](custom_service.md)

## Batch Inference with TorchServe's default handlers

TorchServe's default handlers do not support batch inference.

## Batch Inference with TorchServe using ResNet-152 model
To support batching of inference requests, TorchServe needs the following:
1. TorchServe Model Configuration: TorchServe provides means to configure "Max Batch Size" and "Max Batch Delay" through "POST /models" API.
TorchServe needs to know the maximum batch size that the model can handle and the maximum delay that TorchServe should wait for, to form this request-batch.
2. Model Handler code: TorchServe requires the Model Handler to handle the batch of inference requests.

For a full working code of a custom model handler with batch processing, refer to [resnet152_handler.py](../examples/image_classifier/resnet_152_batch/resnet152_handler.py)
To support batch inference, TorchServe needs the following:

1. TorchServe model configuration: Configure `batch_size` and `max_batch_delay` by using the "POST /models" management API.
TorchServe needs to know the maximum batch size that the model can handle and the maximum time that TorchServe should wait to fill each batch request.
2. Model handler code: TorchServe requires the Model handler to handle batch inference requests.

For a full working example of a custom model handler with batch processing, see [resnet152_handler.py](../examples/image_classifier/resnet_152_batch/resnet152_handler.py)

### TorchServe Model Configuration
To configure TorchServe to use the batching feature, you would have to provide the batch configuration information through [**POST /models** API](management_api.md#register-a-model).
The configuration that we are interested in is the following:
1. `batch_size`: This is the maximum batch size that a model is expected to handle.
2. `max_batch_delay`: This is the maximum batch delay time TorchServe waits to receive `batch_size` number of requests. If TorchServe doesn't receive `batch_size` number of requests
before this timer time's out, it sends what ever requests that were received to the model `handler`.

To configure TorchServe to use the batching feature, provide the batch configuration information through [**POST /models** API](management_api.md#register-a-model).

The configuration that we are interested in is the following:

1. `batch_size`: This is the maximum batch size that a model is expected to handle.
2. `max_batch_delay`: This is the maximum batch delay time TorchServe waits to receive `batch_size` number of requests. If TorchServe doesn't receive `batch_size` number of
requests before this timer time's out, it sends what ever requests that were received to the model `handler`.

Let's look at an example using this configuration

```bash
# The following command will register a model "resnet-152.mar" and configure TorchServe to use a batch_size of 8 and a max batch delay of 50 milli seconds.
curl -X POST "localhost:8081/models?url=resnet-152.mar&batch_size=8&max_batch_delay=50"
```

These configurations are used both in TorchServe and in the model's custom-service-code (a.k.a the handler code). TorchServe associates the batch related configuration with each model. The frontend then tries to aggregate the batch-size number of requests and send it to the backend.

These configurations are used both in TorchServe and in the model's custom service code (a.k.a the handler code).
TorchServe associates the batch related configuration with each model.
The frontend then tries to aggregate the batch-size number of requests and send it to the backend.

## Demo to configure TorchServe with batch-supported model
In this section lets bring up model server and launch Resnet-152 model, which has been built to handle a batch of request.

### Pre-requisites
Follow the main [Readme](../README.md) and install all the required packages including "torchserve"
In this section lets bring up model server and launch Resnet-152 model, which has been built to handle a batch of request.

### Prerequisites

Follow the main [Readme](../README.md) and install all the required packages including `torchserve`.

### Loading Resnet-152 which handles batch inferences

* Start the model server. In this example, we are starting the model server to run on inference port 8080 and management port 8081.

```text
$ cat config.properties
...
Expand All @@ -62,26 +79,29 @@ management_address=http://0.0.0.0:8081
$ torchserve --start --model-store model_store
```

Note : This example assumes that the resnet-152.mar file is available in the torchserve model_store. For more details on creating resnet-152 mar file and serving it on TorchServe refer [resnet152 image classification example](../examples/image_classifier/resnet_152_batch/README.md)
**Note**: This example assumes that the resnet-152.mar file is available in the `model_store`.
For more details on creating resnet-152 mar file and serving it on TorchServe refer [resnet152 image classification example](../examples/image_classifier/resnet_152_batch/README.md)

* Verify that TorchServe is up and running

* Verify that the TorchServe is up and running
```text
$ curl localhost:8080/ping
{
"status": "Healthy"
}
```

* Now lets launch resnet-152 model, which we have built to handle batch inference. Since this is an example, we are going to launch 1 worker which handles a batch size of 8
with a max-batch-delay of 10ms.
* Now let's launch resnet-152 model, which we have built to handle batch inference. Because this is an example, we are going to launch 1 worker which handles a batch size of 8 with a `max_batch_delay` of 10ms.

```text
$ curl -X POST "localhost:8081/models?url=resnet-152.mar&batch_size=8&max_batch_delay=10&initial_workers=1"
{
"status": "Processing worker updates..."
}
```

* Verify that the workers were started properly
* Verify that the workers were started properly.

```text
$ curl localhost:8081/models/resnet-152
{
Expand All @@ -104,12 +124,16 @@ $ curl localhost:8081/models/resnet-152
}
```

* Now let's test this service.
* Now let's test this service.

* Get an image to test this service

```text
$ curl -O https://s3.amazonaws.com/model-server/inputs/kitten.jpg
```
* Run inference to test the model
```

* Run inference to test the model.

```text
$ curl -X POST localhost/predictions/resnet-152 -T kitten.jpg
{
Expand All @@ -133,25 +157,30 @@ $ curl localhost:8081/models/resnet-152
"class": "n02129604 tiger, Panthera tigris"
}
```
* Now that we have the service up and running, we could run performance tests with the same kitten image as follows. There are multiple tools to measure performance of web-servers. We will use

* Now that we have the service up and running, we can run performance tests with the same kitten image as follows. There are multiple tools to measure performance of web-servers. We will use
[apache-bench](https://httpd.apache.org/docs/2.4/programs/ab.html) to run our performance tests. We chose `apache-bench` for our tests because of the ease of installation and ease of running tests.
Before running this test, we need to first install `apache-bench` on our System. Since we were running this on a ubuntu host, we installed apache-bench as follows

Before running this test, we need to first install `apache-bench` on our system. Since we were running this on an Ubuntu host, we install `apache-bench` as follows:

```bash
$ sudo apt-get udpate && sudo apt-get install apache2-utils
```
Now that installation is done, we can run performance benchmark test as follows.
```

Now that installation is done, we can run performance benchmark test as follows.

```text
$ ab -k -l -n 10000 -c 1000 -T "image/jpeg" -p kitten.jpg localhost:8080/predictions/resnet-152
```

The above test simulates TorchServe receiving 1000 concurrent requests at once and a total of 10,000 requests. All of these requests are directed to the endpoint "localhost:8080/predictions/resnet-152", which assumes
that resnet-152 is already registered and scaled-up on TorchServe. We had done this registration and scaling up in the above steps.

## Conclusion
The take away from the experiments is that batching is a very useful feature. In cases where the services receive heavy load of requests or each request has high I/O, its advantageous
to batch the requests. This allows for maximally utilizing the compute resources, especially GPU compute which are also more often than not more expensive. But customers should
do their due diligence and perform enough tests to find optimal batch size depending on the number of GPUs available and number of models loaded per GPU. Customers should also
analyze their traffic patterns before enabling the batch-inference. As shown in the above experiments, services receiving TPS lesser than the batch size would lead to consistent
"batch delay" timeouts and cause the response latency per request to spike. As any cutting edge technology, batch-inference is definitely a double edged sword.


The take away from this example is that batching is a very useful feature. In cases where the services receive heavy load of requests or each request has high I/O,
it's advantageous to batch the requests. This allows for maximally utilizing the compute resources, especially GPU resources, which are more expensive. But customers should do their due diligence and perform enough tests to find optimal batch size depending on the number of GPUs available
and number of models loaded per GPU.
You should also analyze your traffic patterns before enabling the batch inference. As shown in the above experiments,
services receiving TPS less than than the batch size would lead to consistent "batch delay" timeouts and cause the response latency per request to spike.
As with any cutting-edge technology, batch inference is definitely a double-edged sword.
15 changes: 7 additions & 8 deletions docs/code_coverage.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
## To execute unit testing and generate code coverage report follow these steps:
# Execute unit testing and generate a code coverage report

## Pre-requisites
## Prerequisites

You will need some additional Python modules to run the unit tests and linting.
You need some additional Python modules to run the unit tests and linting.

```bash
pip install mock pytest pylint pytest-mock pytest-cov
Expand All @@ -23,9 +23,8 @@ cd serve

* torch-model-archive pytest suite

The reports can be accessed at the following path :

- TorchServe frontende : serve/frontend/server/build/reports
- TorchServe backend : serve/htmlcov
- torch-model-archiver : serve/model-archiver/htmlcov
The reports can be accessed at the following paths:

* TorchServe frontend: `serve/frontend/server/build/reports`
* TorchServe backend: `serve/htmlcov`
* torch-model-archiver: `serve/model-archiver/htmlcov`
Loading

0 comments on commit 6642df0

Please sign in to comment.