Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change docker image to production #2227

Merged
merged 3 commits into from
Apr 20, 2023
Merged

Change docker image to production #2227

merged 3 commits into from
Apr 20, 2023

Conversation

agunapal
Copy link
Collaborator

@agunapal agunapal commented Apr 12, 2023

Description

We have been pushing the "dev" docker image for the last few releases.

  • Changing the image to production to reduce the size of the image
  • For CPU images, this is reducing the size from 3.84 GB -> 2.17GB
  • For GPU images, this is reducing the size from 11.9GB -> 8.57GB

Fixes #(issue)

Type of change

Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • New feature (non-breaking change which adds functionality)
  • This change requires a documentation update

Feature/Issue validation/testing

(torchserve) ubuntu@ip-172-31-60-100:~/serve/docker$ ./build_image.sh -t pytorch/torchserve:latest
[+] Building 166.5s (22/22) FINISHED                                                                                                                                            
 => [internal] load build definition from Dockerfile                                                                                                                       0.0s
 => => transferring dockerfile: 5.61kB                                                                                                                                     0.0s
 => [internal] load .dockerignore                                                                                                                                          0.0s
 => => transferring context: 2B                                                                                                                                            0.0s
 => resolve image config for docker.io/docker/dockerfile:experimental                                                                                                      0.6s
 => CACHED docker-image://docker.io/docker/dockerfile:experimental@sha256:600e5c62eedff338b3f7a0850beb7c05866e0ef27b2d2e8c02aa468e78496ff5                                 0.0s
 => [internal] load metadata for docker.io/library/ubuntu:20.04                                                                                                            0.0s
 => [runtime-image 1/9] FROM docker.io/library/ubuntu:20.04                                                                                                                0.0s
 => [internal] load build context                                                                                                                                          0.0s
 => => transferring context: 80B                                                                                                                                           0.0s
 => [runtime-image 2/9] RUN --mount=type=cache,target=/var/cache/apt     apt-get update &&     apt-get upgrade -y &&     apt-get install software-properties-common -y   143.1s
 => [compile-image 2/7] RUN --mount=type=cache,id=apt-dev,target=/var/cache/apt     apt-get update &&     apt-get upgrade -y &&     apt-get install software-properties  107.4s
 => [compile-image 3/7] RUN python3.9 -m venv /home/venv                                                                                                                   3.4s
 => [compile-image 4/7] RUN python -m pip install -U pip setuptools                                                                                                        3.6s
 => [compile-image 5/7] RUN export USE_CUDA=1                                                                                                                              0.6s
 => [compile-image 6/7] RUN TORCH_VER=$(curl --silent --location https://pypi.org/pypi/torch/json | python -c "import sys, json, pkg_resources; releases = json.load(sys  29.9s
 => [runtime-image 3/9] RUN useradd -m model-server     && mkdir -p /home/model-server/tmp                                                                                 0.6s
 => [compile-image 7/7] RUN python -m pip install --no-cache-dir captum torchtext torchserve torch-model-archiver pyyaml                                                   5.9s
 => [runtime-image 4/9] COPY --chown=model-server --from=compile-image /home/venv /home/venv                                                                               4.9s 
 => [runtime-image 5/9] COPY dockerd-entrypoint.sh /usr/local/bin/dockerd-entrypoint.sh                                                                                    0.0s 
 => [runtime-image 6/9] RUN chmod +x /usr/local/bin/dockerd-entrypoint.sh     && chown -R model-server /home/model-server                                                  0.3s 
 => [runtime-image 7/9] COPY config.properties /home/model-server/config.properties                                                                                        0.0s 
 => [runtime-image 8/9] RUN mkdir /home/model-server/model-store && chown -R model-server /home/model-server/model-store                                                   0.4s 
 => [runtime-image 9/9] WORKDIR /home/model-server                                                                                                                         0.0s 
 => exporting to image                                                                                                                                                     5.7s 
 => => exporting layers                                                                                                                                                    5.7s 
 => => writing image sha256:fd43886d6e11d3cc23b402c190909f732a147fb3c251bd93169db0a7ae9f733f                                                                               0.0s 
 => => naming to docker.io/pytorch/torchserve:latest                                                                                                                       0.0s
pytorch/torchserve-dev   latest                     ecd189c65df1   27 seconds ago   3.84GB
pytorch/torchserve       latest                     fd43886d6e11   10 minutes ago   2.17GB

##TestCases

CPU

  • BERT
~/serve/examples/Huggingface_Transformers$ python Download_Transformer_models.py
Transformers version 4.11.0
Download model and tokenizer bert-base-uncased
...
Successfully created directory ./Transformer_model 
Save model and tokenizer/ Torchscript model based on the setting from setup_config bert-base-uncased in directory ./Transformer_model
~/serve/examples/Huggingface_Transformers$ docker run --rm -it -p 8080:8080 -p 8081:8081 -p 8082:8082 -v $(pwd)/model_store:/home/model-server/model-store pytorch/torchserve:latest
WARNING: sun.reflect.Reflection.getCallerClass is not supported. This will impact performance.
2023-04-19T20:01:13,538 [INFO ] main org.pytorch.serve.servingsdk.impl.PluginsManager - Initializing plugins manager...
2023-04-19T20:01:13,642 [INFO ] main org.pytorch.serve.ModelServer - 
Torchserve version: 0.7.1

Install transformers in docker container
docker exec -it 3025 /bin/bash
model-server@3025a0fd0bfa:~$ pip install transformers
Collecting transformers

curl -X POST "localhost:8081/models?model_name=my_tc&url=BERTSeqClassification.mar&initial_workers=1"
{
  "status": "Model \"my_tc\" Version: 1.0 registered with 1 initial workers"
}
 curl -X POST http://127.0.0.1:8080/predictions/my_tc -T Seq_classification_artifacts/sample_text_captum_input.txt
Not Accepted
  • ResNet18
wget https://download.pytorch.org/models/resnet18-f37072fd.pth
torch-model-archiver --model-name resnet-18 --version 1.0 --model-file ./examples/image_classifier/resnet_18/model.py --serialized-file resnet18-f37072fd.pth --handler image_classifier --extra-files ./examples/image_classifier/index_to_name.json
mkdir model_store
mv resnet-18.mar model_store/
docker run --rm -it -p 8080:8080 -p 8081:8081 -p 8082:8082 -v $(pwd)/model_store:/home/model-server/model-store pytorch/torchserve:latest
WARNING: sun.reflect.Reflection.getCallerClass is not supported. This will impact performance.
2023-04-19T20:18:29,197 [INFO ] main org.pytorch.serve.servingsdk.impl.PluginsManager - Initializing plugins manager...
2023-04-19T20:18:29,317 [INFO ] main org.pytorch.serve.ModelServer - 
Torchserve version: 0.7.1

curl -X POST "localhost:8081/models?model_name=resnet-18&url=resnet-18.mar&initial_workers=1"
{
  "status": "Model \"resnet-18\" Version: 1.0 registered with 1 initial workers"
}

curl http://127.0.0.1:8080/predictions/resnet-18 -T ./examples/image_classifier/kitten.jpg
{
  "tabby": 0.40966305136680603,
  "tiger_cat": 0.34670504927635193,
  "Egyptian_cat": 0.1300286501646042,
  "lynx": 0.023919589817523956,
  "bucket": 0.011532178148627281
}

GPU

  • BERT
~/serve/examples/Huggingface_Transformers$ python Download_Transformer_models.py
Transformers version 4.11.0
Download model and tokenizer bert-base-uncased
...
Successfully created directory ./Transformer_model 
Save model and tokenizer/ Torchscript model based on the setting from setup_config bert-base-uncased in directory ./Transformer_model
docker run --rm -it --gpus all -p 8080:8080 -p 8081:8081 -p 8082:8082 -v $(pwd)/model_store:/home/model-server/model-store pytorch/torchserve:latest-gpu
WARNING: sun.reflect.Reflection.getCallerClass is not supported. This will impact performance.
2023-04-19T21:50:55,529 [INFO ] main org.pytorch.serve.servingsdk.impl.PluginsManager - Initializing plugins manager...
2023-04-19T21:50:55,627 [INFO ] main org.pytorch.serve.ModelServer - 
Torchserve version: 0.7.1
TS Home: /home/venv/lib/python3.9/site-packages
Current directory: /home/model-server
Temp directory: /home/model-server/tmp
Metrics config path: /home/venv/lib/python3.9/site-packages/ts/configs/metrics.yaml
Number of GPUs: 1
Number of CPUs: 8


Install transformers in docker container
docker ps
CONTAINER ID   IMAGE                           COMMAND                  CREATED          STATUS          PORTS                                                                          NAMES
7306991cd501   pytorch/torchserve:latest-gpu   "/usr/local/bin/dock…"   29 seconds ago   Up 28 seconds   7070-7071/tcp, 0.0.0.0:8080-8082->8080-8082/tcp, :::8080-8082->8080-8082/tcp   silly_jackson

docker exec -it 7306 /bin/bash
model-server@7306991cd501:~$ pip install transformers
Collecting transformers


curl -X POST "localhost:8081/models?model_name=my_tc&url=BERTSeqClassification.mar&initial_workers=1"
{
  "status": "Model \"my_tc\" Version: 1.0 registered with 1 initial workers"
}
 curl -X POST http://127.0.0.1:8080/predictions/my_tc -T Seq_classification_artifacts/sample_text_captum_input.txt
Not Accepted
  • ResNet18
wget https://download.pytorch.org/models/resnet18-f37072fd.pth
torch-model-archiver --model-name resnet-18 --version 1.0 --model-file ./examples/image_classifier/resnet_18/model.py --serialized-file resnet18-f37072fd.pth --handler image_classifier --extra-files ./examples/image_classifier/index_to_name.json
mkdir model_store
mv resnet-18.mar model_store/
docker run --rm -it --gpus all -p 8080:8080 -p 8081:8081 -p 8082:8082 -v $(pwd)/model_store:/home/model-server/model-store pytorch/torchserve:latest-gpu
WARNING: sun.reflect.Reflection.getCallerClass is not supported. This will impact performance.
2023-04-19T21:46:34,986 [INFO ] main org.pytorch.serve.servingsdk.impl.PluginsManager - Initializing plugins manager...
2023-04-19T21:46:35,151 [INFO ] main org.pytorch.serve.ModelServer - 
Torchserve version: 0.7.1
TS Home: /home/venv/lib/python3.9/site-packages
Current directory: /home/model-server
Temp directory: /home/model-server/tmp
Metrics config path: /home/venv/lib/python3.9/site-packages/ts/configs/metrics.yaml
Number of GPUs: 1
Number of CPUs: 8


curl -X POST "localhost:8081/models?model_name=resnet-18&url=resnet-18.mar&initial_workers=1"
{
  "status": "Model \"resnet-18\" Version: 1.0 registered with 1 initial workers"
}

curl http://127.0.0.1:8080/predictions/resnet-18 -T ./examples/image_classifier/kitten.jpg
{
  "tabby": 0.40966305136680603,
  "tiger_cat": 0.34670504927635193,
  "Egyptian_cat": 0.1300286501646042,
  "lynx": 0.023919589817523956,
  "bucket": 0.011532178148627281
}

Checklist:

  • Did you have fun?
  • Have you added tests that prove your fix is effective or that this feature works?
  • Has code been commented, particularly in hard-to-understand areas?
  • Have you made corresponding changes to the documentation?

@agunapal agunapal requested a review from msaroufim April 12, 2023 18:35
@codecov
Copy link

codecov bot commented Apr 12, 2023

Codecov Report

Merging #2227 (5be2caa) into master (cf7544b) will not change coverage.
The diff coverage is n/a.

❗ Current head 5be2caa differs from pull request most recent head 130819e. Consider uploading reports for the commit 130819e to get more accurate results

@@           Coverage Diff           @@
##           master    #2227   +/-   ##
=======================================
  Coverage   71.47%   71.47%           
=======================================
  Files          73       73           
  Lines        3341     3341           
  Branches       57       57           
=======================================
  Hits         2388     2388           
  Misses        950      950           
  Partials        3        3           

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@msaroufim
Copy link
Member

msaroufim commented Apr 12, 2023

For this image could you please run a few sample inferences on it for an image and a language model? I remember we moved to the dev image in the first place because some stuff was broken for prod just can't remember what exactly

EDIT: I'd also like to see the CUDA image explicitly tested

@fabridamicelli
Copy link
Contributor

For this image could you please run a few sample inferences on it for an image and a language model? I remember we moved to the dev image in the first place because some stuff was broken for prod just can't remember what exactly

EDIT: I'd also like to see the CUDA image explicitly tested

A side comment if I may:
For image model (on CPU), this test in CI is now covering it.
I just mention it because that might be a good place to extend the test with a language model example (at least for the CPU case).

@agunapal
Copy link
Collaborator Author

@fabridamicelli I have attached the logs of a test-case with ResNet, HF BERT models.
I did not want to copy paste the code and extend the script.
I think we should extend the testcases in a more organized effort and avoid code duplication. I am looking forward to your design suggestion for how we want to do this.

@agunapal agunapal requested a review from lxning April 19, 2023 21:56
@fabridamicelli
Copy link
Contributor

@agunapal

@fabridamicelli I have attached the logs of a test-case with ResNet, HF BERT models. I did not want to copy paste the code and extend the script. I think we should extend the testcases in a more organized effort and avoid code duplication. I am looking forward to your design suggestion for how we want to do this.

Thanks for the update!
That makes sense. I am already prototyping a little testing scaffold for the docker based examples (built with pytest) to avoid some of this manual work. I'll put is as draft PR and ping you as soon as I have something worth looking at so that we can discuss about having something concrete at hand

@agunapal
Copy link
Collaborator Author

@fabridamicelli Thanks for the update! There is no rush. Looking forward to it

@agunapal agunapal merged commit dd8e792 into master Apr 20, 2023
@agunapal agunapal deleted the feature/use_docker_prod branch April 20, 2023 16:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants