Change docker image to production #2227

agunapal · 2023-04-12T18:29:13Z

Description

We have been pushing the "dev" docker image for the last few releases.

Changing the image to production to reduce the size of the image
For CPU images, this is reducing the size from 3.84 GB -> 2.17GB
For GPU images, this is reducing the size from 11.9GB -> 8.57GB

Fixes #(issue)

Type of change

Please delete options that are not relevant.

Bug fix (non-breaking change which fixes an issue)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
New feature (non-breaking change which adds functionality)
This change requires a documentation update

Feature/Issue validation/testing

(torchserve) ubuntu@ip-172-31-60-100:~/serve/docker$ ./build_image.sh -t pytorch/torchserve:latest
[+] Building 166.5s (22/22) FINISHED                                                                                                                                            
 => [internal] load build definition from Dockerfile                                                                                                                       0.0s
 => => transferring dockerfile: 5.61kB                                                                                                                                     0.0s
 => [internal] load .dockerignore                                                                                                                                          0.0s
 => => transferring context: 2B                                                                                                                                            0.0s
 => resolve image config for docker.io/docker/dockerfile:experimental                                                                                                      0.6s
 => CACHED docker-image://docker.io/docker/dockerfile:experimental@sha256:600e5c62eedff338b3f7a0850beb7c05866e0ef27b2d2e8c02aa468e78496ff5                                 0.0s
 => [internal] load metadata for docker.io/library/ubuntu:20.04                                                                                                            0.0s
 => [runtime-image 1/9] FROM docker.io/library/ubuntu:20.04                                                                                                                0.0s
 => [internal] load build context                                                                                                                                          0.0s
 => => transferring context: 80B                                                                                                                                           0.0s
 => [runtime-image 2/9] RUN --mount=type=cache,target=/var/cache/apt     apt-get update &&     apt-get upgrade -y &&     apt-get install software-properties-common -y   143.1s
 => [compile-image 2/7] RUN --mount=type=cache,id=apt-dev,target=/var/cache/apt     apt-get update &&     apt-get upgrade -y &&     apt-get install software-properties  107.4s
 => [compile-image 3/7] RUN python3.9 -m venv /home/venv                                                                                                                   3.4s
 => [compile-image 4/7] RUN python -m pip install -U pip setuptools                                                                                                        3.6s
 => [compile-image 5/7] RUN export USE_CUDA=1                                                                                                                              0.6s
 => [compile-image 6/7] RUN TORCH_VER=$(curl --silent --location https://pypi.org/pypi/torch/json | python -c "import sys, json, pkg_resources; releases = json.load(sys  29.9s
 => [runtime-image 3/9] RUN useradd -m model-server     && mkdir -p /home/model-server/tmp                                                                                 0.6s
 => [compile-image 7/7] RUN python -m pip install --no-cache-dir captum torchtext torchserve torch-model-archiver pyyaml                                                   5.9s
 => [runtime-image 4/9] COPY --chown=model-server --from=compile-image /home/venv /home/venv                                                                               4.9s 
 => [runtime-image 5/9] COPY dockerd-entrypoint.sh /usr/local/bin/dockerd-entrypoint.sh                                                                                    0.0s 
 => [runtime-image 6/9] RUN chmod +x /usr/local/bin/dockerd-entrypoint.sh     && chown -R model-server /home/model-server                                                  0.3s 
 => [runtime-image 7/9] COPY config.properties /home/model-server/config.properties                                                                                        0.0s 
 => [runtime-image 8/9] RUN mkdir /home/model-server/model-store && chown -R model-server /home/model-server/model-store                                                   0.4s 
 => [runtime-image 9/9] WORKDIR /home/model-server                                                                                                                         0.0s 
 => exporting to image                                                                                                                                                     5.7s 
 => => exporting layers                                                                                                                                                    5.7s 
 => => writing image sha256:fd43886d6e11d3cc23b402c190909f732a147fb3c251bd93169db0a7ae9f733f                                                                               0.0s 
 => => naming to docker.io/pytorch/torchserve:latest                                                                                                                       0.0s

pytorch/torchserve-dev   latest                     ecd189c65df1   27 seconds ago   3.84GB
pytorch/torchserve       latest                     fd43886d6e11   10 minutes ago   2.17GB

##TestCases

CPU

BERT

~/serve/examples/Huggingface_Transformers$ python Download_Transformer_models.py
Transformers version 4.11.0
Download model and tokenizer bert-base-uncased
...
Successfully created directory ./Transformer_model 
Save model and tokenizer/ Torchscript model based on the setting from setup_config bert-base-uncased in directory ./Transformer_model

~/serve/examples/Huggingface_Transformers$ docker run --rm -it -p 8080:8080 -p 8081:8081 -p 8082:8082 -v $(pwd)/model_store:/home/model-server/model-store pytorch/torchserve:latest
WARNING: sun.reflect.Reflection.getCallerClass is not supported. This will impact performance.
2023-04-19T20:01:13,538 [INFO ] main org.pytorch.serve.servingsdk.impl.PluginsManager - Initializing plugins manager...
2023-04-19T20:01:13,642 [INFO ] main org.pytorch.serve.ModelServer - 
Torchserve version: 0.7.1

Install transformers in docker container
docker exec -it 3025 /bin/bash
model-server@3025a0fd0bfa:~$ pip install transformers
Collecting transformers

curl -X POST "localhost:8081/models?model_name=my_tc&url=BERTSeqClassification.mar&initial_workers=1"
{
  "status": "Model \"my_tc\" Version: 1.0 registered with 1 initial workers"
}

 curl -X POST http://127.0.0.1:8080/predictions/my_tc -T Seq_classification_artifacts/sample_text_captum_input.txt
Not Accepted

ResNet18

wget https://download.pytorch.org/models/resnet18-f37072fd.pth
torch-model-archiver --model-name resnet-18 --version 1.0 --model-file ./examples/image_classifier/resnet_18/model.py --serialized-file resnet18-f37072fd.pth --handler image_classifier --extra-files ./examples/image_classifier/index_to_name.json
mkdir model_store
mv resnet-18.mar model_store/

docker run --rm -it -p 8080:8080 -p 8081:8081 -p 8082:8082 -v $(pwd)/model_store:/home/model-server/model-store pytorch/torchserve:latest
WARNING: sun.reflect.Reflection.getCallerClass is not supported. This will impact performance.
2023-04-19T20:18:29,197 [INFO ] main org.pytorch.serve.servingsdk.impl.PluginsManager - Initializing plugins manager...
2023-04-19T20:18:29,317 [INFO ] main org.pytorch.serve.ModelServer - 
Torchserve version: 0.7.1

curl -X POST "localhost:8081/models?model_name=resnet-18&url=resnet-18.mar&initial_workers=1"
{
  "status": "Model \"resnet-18\" Version: 1.0 registered with 1 initial workers"
}

curl http://127.0.0.1:8080/predictions/resnet-18 -T ./examples/image_classifier/kitten.jpg
{
  "tabby": 0.40966305136680603,
  "tiger_cat": 0.34670504927635193,
  "Egyptian_cat": 0.1300286501646042,
  "lynx": 0.023919589817523956,
  "bucket": 0.011532178148627281
}

GPU

BERT

~/serve/examples/Huggingface_Transformers$ python Download_Transformer_models.py
Transformers version 4.11.0
Download model and tokenizer bert-base-uncased
...
Successfully created directory ./Transformer_model 
Save model and tokenizer/ Torchscript model based on the setting from setup_config bert-base-uncased in directory ./Transformer_model

docker run --rm -it --gpus all -p 8080:8080 -p 8081:8081 -p 8082:8082 -v $(pwd)/model_store:/home/model-server/model-store pytorch/torchserve:latest-gpu
WARNING: sun.reflect.Reflection.getCallerClass is not supported. This will impact performance.
2023-04-19T21:50:55,529 [INFO ] main org.pytorch.serve.servingsdk.impl.PluginsManager - Initializing plugins manager...
2023-04-19T21:50:55,627 [INFO ] main org.pytorch.serve.ModelServer - 
Torchserve version: 0.7.1
TS Home: /home/venv/lib/python3.9/site-packages
Current directory: /home/model-server
Temp directory: /home/model-server/tmp
Metrics config path: /home/venv/lib/python3.9/site-packages/ts/configs/metrics.yaml
Number of GPUs: 1
Number of CPUs: 8

Install transformers in docker container
docker ps
CONTAINER ID   IMAGE                           COMMAND                  CREATED          STATUS          PORTS                                                                          NAMES
7306991cd501   pytorch/torchserve:latest-gpu   "/usr/local/bin/dock…"   29 seconds ago   Up 28 seconds   7070-7071/tcp, 0.0.0.0:8080-8082->8080-8082/tcp, :::8080-8082->8080-8082/tcp   silly_jackson

docker exec -it 7306 /bin/bash
model-server@7306991cd501:~$ pip install transformers
Collecting transformers

curl -X POST "localhost:8081/models?model_name=my_tc&url=BERTSeqClassification.mar&initial_workers=1"
{
  "status": "Model \"my_tc\" Version: 1.0 registered with 1 initial workers"
}

 curl -X POST http://127.0.0.1:8080/predictions/my_tc -T Seq_classification_artifacts/sample_text_captum_input.txt
Not Accepted

ResNet18

wget https://download.pytorch.org/models/resnet18-f37072fd.pth
torch-model-archiver --model-name resnet-18 --version 1.0 --model-file ./examples/image_classifier/resnet_18/model.py --serialized-file resnet18-f37072fd.pth --handler image_classifier --extra-files ./examples/image_classifier/index_to_name.json
mkdir model_store
mv resnet-18.mar model_store/

docker run --rm -it --gpus all -p 8080:8080 -p 8081:8081 -p 8082:8082 -v $(pwd)/model_store:/home/model-server/model-store pytorch/torchserve:latest-gpu
WARNING: sun.reflect.Reflection.getCallerClass is not supported. This will impact performance.
2023-04-19T21:46:34,986 [INFO ] main org.pytorch.serve.servingsdk.impl.PluginsManager - Initializing plugins manager...
2023-04-19T21:46:35,151 [INFO ] main org.pytorch.serve.ModelServer - 
Torchserve version: 0.7.1
TS Home: /home/venv/lib/python3.9/site-packages
Current directory: /home/model-server
Temp directory: /home/model-server/tmp
Metrics config path: /home/venv/lib/python3.9/site-packages/ts/configs/metrics.yaml
Number of GPUs: 1
Number of CPUs: 8

curl -X POST "localhost:8081/models?model_name=resnet-18&url=resnet-18.mar&initial_workers=1"
{
  "status": "Model \"resnet-18\" Version: 1.0 registered with 1 initial workers"
}

curl http://127.0.0.1:8080/predictions/resnet-18 -T ./examples/image_classifier/kitten.jpg
{
  "tabby": 0.40966305136680603,
  "tiger_cat": 0.34670504927635193,
  "Egyptian_cat": 0.1300286501646042,
  "lynx": 0.023919589817523956,
  "bucket": 0.011532178148627281
}

Checklist:

Did you have fun?
Have you added tests that prove your fix is effective or that this feature works?
Has code been commented, particularly in hard-to-understand areas?
Have you made corresponding changes to the documentation?

codecov · 2023-04-12T18:56:50Z

Codecov Report

Merging #2227 (5be2caa) into master (cf7544b) will not change coverage.
The diff coverage is n/a.

❗ Current head 5be2caa differs from pull request most recent head 130819e. Consider uploading reports for the commit 130819e to get more accurate results

@@           Coverage Diff           @@
##           master    #2227   +/-   ##
=======================================
  Coverage   71.47%   71.47%           
=======================================
  Files          73       73           
  Lines        3341     3341           
  Branches       57       57           
=======================================
  Hits         2388     2388           
  Misses        950      950           
  Partials        3        3

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

msaroufim · 2023-04-12T22:47:44Z

For this image could you please run a few sample inferences on it for an image and a language model? I remember we moved to the dev image in the first place because some stuff was broken for prod just can't remember what exactly

EDIT: I'd also like to see the CUDA image explicitly tested

fabridamicelli · 2023-04-14T10:27:04Z

For this image could you please run a few sample inferences on it for an image and a language model? I remember we moved to the dev image in the first place because some stuff was broken for prod just can't remember what exactly

EDIT: I'd also like to see the CUDA image explicitly tested

A side comment if I may:
For image model (on CPU), this test in CI is now covering it.
I just mention it because that might be a good place to extend the test with a language model example (at least for the CPU case).

agunapal · 2023-04-19T20:24:45Z

@fabridamicelli I have attached the logs of a test-case with ResNet, HF BERT models.
I did not want to copy paste the code and extend the script.
I think we should extend the testcases in a more organized effort and avoid code duplication. I am looking forward to your design suggestion for how we want to do this.

fabridamicelli · 2023-04-20T14:07:10Z

@agunapal

@fabridamicelli I have attached the logs of a test-case with ResNet, HF BERT models. I did not want to copy paste the code and extend the script. I think we should extend the testcases in a more organized effort and avoid code duplication. I am looking forward to your design suggestion for how we want to do this.

Thanks for the update!
That makes sense. I am already prototyping a little testing scaffold for the docker based examples (built with pytest) to avoid some of this manual work. I'll put is as draft PR and ping you as soon as I have something worth looking at so that we can discuss about having something concrete at hand

agunapal · 2023-04-20T16:47:21Z

@fabridamicelli Thanks for the update! There is no rush. Looking forward to it

Changed docker image to production

2056903

agunapal requested a review from msaroufim April 12, 2023 18:35

agunapal requested a review from namannandan April 12, 2023 19:06

namannandan approved these changes Apr 12, 2023

View reviewed changes

Merge branch 'master' into feature/use_docker_prod

1ee80ac

Merge branch 'master' into feature/use_docker_prod

130819e

agunapal requested a review from lxning April 19, 2023 21:56

lxning approved these changes Apr 20, 2023

View reviewed changes

agunapal merged commit dd8e792 into master Apr 20, 2023

agunapal deleted the feature/use_docker_prod branch April 20, 2023 16:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change docker image to production #2227

Change docker image to production #2227

agunapal commented Apr 12, 2023 •

edited

Loading

codecov bot commented Apr 12, 2023 •

edited

Loading

msaroufim commented Apr 12, 2023 •

edited

Loading

fabridamicelli commented Apr 14, 2023

agunapal commented Apr 19, 2023

fabridamicelli commented Apr 20, 2023

agunapal commented Apr 20, 2023

Change docker image to production #2227

Change docker image to production #2227

Conversation

agunapal commented Apr 12, 2023 • edited Loading

Description

Type of change

Feature/Issue validation/testing

CPU

GPU

Checklist:

codecov bot commented Apr 12, 2023 • edited Loading

Codecov Report

msaroufim commented Apr 12, 2023 • edited Loading

fabridamicelli commented Apr 14, 2023

agunapal commented Apr 19, 2023

fabridamicelli commented Apr 20, 2023

agunapal commented Apr 20, 2023

agunapal commented Apr 12, 2023 •

edited

Loading

codecov bot commented Apr 12, 2023 •

edited

Loading

msaroufim commented Apr 12, 2023 •

edited

Loading