TorchServe v0.10.0 Release Notes
This is the release of TorchServe v0.10.0.
Highlights include
- Extended support for PyTorch 2.x inference
- C++ backend
- GenAI fast series
torch.compile
showcase examples - Token authentication support for enhanced security.
C++ Backend
TorchServe presented the experimental C++ backend at the PyTorch Conference 2022. Similar to the Python backend, C++ backend also runs as a process and utilizes the BaseHandler to define APIs for customizing the handler. By providing a backend and handler written in pure C++ for TorchServe, it is now possible to deploy PyTorch models without any Python overhead. This release officially promoted the experimental branch to the master branch and included additional examples and Docker images for development.
- Refactored C++ backend branch and promoted it to master #2840 #2927 #2937 #2953 #2975 #2980 #2958 #3006 #3012 #3014 #3018 @mreso
- C++ backend examples:
a. Example Baby Llama #2903 #2911 @shrinath-suresh @mreso
b. Example Llama2 #2904 @shrinath-suresh @mreso - C++ dev Docker for CPU and GPU #2976 #3015 @namannandan
torch.compile
With the launch of PT2 Inference at the PyTorch Conference 2023, we have added several key examples showcasing out-of-box speedups for torch.compile
and AOT Compile. Since there is no new development being done in TorchScript, starting this release, TorchServe is preparing the migration path for customers to switch from TorchScript to torch.compile
.
GenAI torch.compile series
The fast series GenAI models - GPTFast, SegmentAnythingFast, DiffusionFast with 3-10x speedups using torch.compile and native PyTorch optimizations:
- Example GPT Fast #2815 #2834 #2935 @mreso and deployment with KServe #2966 #2895 @agunapal
- Example Segment Anything Fast #2802 @agunapal
- Example Diffusion Fast #2902 @agunapal
Cold start problem solution
To address cold start problems, there is an example included to show how torch._export.aot_load
(experimental API) can be used to load a pre-compiled model. TorchServe has also started benchmarking models with torch.compile
and tracking their performance compared to TorchScript.
The new TorchServe C++ backend also includes torch.compile and AOTInductor related examples for ResNet50, BERT and Llama2.
-
torch.compile
a. Exampletorch.compile
with image classifier model densenet161 #2915 @agunapal
b. Exampletorch._export.aot_compile
with image classification model ResNet-18 #2832 #2906 #2932 #2948 @agunapal
c. Example torch inductor fx graph caching with image classification model densenet161 #2925 @agunapal -
C++ AOTInductor
a. Example AOT Inductor with Llama2 #2913 @mreso
b. Example AOT Inductor with ResNet-50 #2944 @lxning
c. Example AOT Inductor with BERTSequenceClassification #2931 @lxning
Gen AI
- Supported sequence batching for stateful inference in gRPC bi-directional streaming #2513 @lxning
- The fast series Gen AI models using
torch.compile
and native PyTorch optimizations. - Example Mistral 7B with vLLM #2781 @agunapal
- Example PyTorch native tensor parallel with Llama2 with continuous batching #2709 @mreso @HamidShojanazeri
- Supported inf2 Neuronx transformer continuous batching for both no coding style and advanced customers with Llama2-70B example #2803 #3016 @lxning
- Example deepspeed mii fastgen with Llama2-13B #2779 @lxning
Security
TorchServe has implemented token authentication for management and inference APIs. This is an optional config and can be enabled using torchserve-endpoint-plugin
. This plugin can be downloaded from maven. This further strengthens TorchServe’s capability as a secure model serving solution. The security features of TorchServe are documented here
Apple Silicon Support
TorchServe is now supported on Apple Silicon mac. The current support is for CPU only. We have also posted an RFC for the deprecation of x86 mac support.
- Include arm64 mac in CI workflows #2934 @udaij12
- Conda binaries build support #3013 @udaij12
- Adding support for regression tests for binaries #3019 @udaij12
KServe Updates
While serving large models, model loading can take some time even though the pod is running. Even though TorchServe is up, the worker is not ready till the model is loaded. To address this, TorchServe now sets the model ready status in KServe after the model has been loaded on workers. TorchServe also includes native open inference protocol support in gRPC. This is an experiment feature.
- Supported native KServe open inference protocol in gRPC #2609 @andyi2it
- Refactored TorchServe configuration in KServe #2995 @sgaist
- Improved KServe protocol version handling #2957 @sgaist
- Updated KServe test script to return model version #2973 @agunapal
- Set model status using TorchServe API in KServe #1878 @byeongjokim
- Supported no-archive model archiver in KServe #2839 @agunapal
- How to deploy MNIST using KServe with minikube #2718 @agunapal
- Changes to support no-model archive mode with KServe #2839 @agunpal
Metrics Updates
In order to extend backwards compatibility support for metrics, auto-detection of backend metrics enables the flexibility to publish custom model metrics without having to explicitly specify them in the metrics configuration file. Furthermore, a customized script to collect system metrics is also now supported.
- Supported backend metrics auto-detection #2769 @namannandan
- Fixed backend metrics backward compatible #2816 @namannandan
- Supported customized system metrics script via config.properties #3000 @lxning
Improvements and Bug Fixing
- Supported PyTorch 2.2.1 #2959 #2972 and Release version updated #3010 @agunapal
- Enabled option of installing model's 3rd party dependency in Python virtual environment via model config yaml file #2910 #2946 #2954 @namannandan
- Fixed worker auto recovery #2746 @mreso
- Fixed worker thread write and flush incomplete #2833 @lxning
- Fixed the priority of parameters defined in register curl vs model-config.yaml #2858 @lxning
- Refactored sanity check with pytest #2221 @mreso
- Fixed model state if runtime is null from model archiver #2928 @mreso
- Refactored benchmark script for LLM benchmark integration #2897 @mreso
- Added pytest for tensor parallel #2741 @mreso
- Fixed continuous batching unit test #2847 @mreso
- Added separate pytest for send_intermediate_prediction_response #2896 @mreso
- Fixed GPU ID in GPT Fast handler #2872 @sachanub
- Added model archiver API #2751 @GeeCastro
- Updated torch.compile in BaseHandler to accept kwargs via model config yaml file #2796 @eballesteros
- Integrated pytorch-probot into the TorchServe #2725 @atalman
- Added queue time in benchmark report #2854 @sachanub
- Replaced no_grad with inference_mode in BaseHandler #2804 @bryant1410
- Fixed env var CUDA_VERSION conflict in Dockerfile #2807 @rsbowman-striveworks
- Fixed var USE_CUDA_VERSION in Dockerfile #2982 @fyang93
- Fixed BASE_IMAGE for k8s docker image #2808 @rsbowman-striveworks
- Fixed workflow store path in config.properties overwritten by the default workflow path #2792 @udaij12
- Removed invalid warning log #2867 @lxning
- Updated PyTorch nightly url and CPU version in install_dependency.py #2971 #3011 @agunapal
- Deprecated Dockerfile.dev, build dev and prod docker image from single source Dockerfile #2782 @sachanub
- Updated transformers version to >= 4.34.0 #2703 @agunapal
- Fixed Neuronx requirements #2887 #2900 @namannandan
- Added neuron SDK installation in install_dependencies.py #2893 @mreso
- Updated ResNet-152 example output #2745 @sachanub
- Clarified that "Not Accepted" is a valid classification in Huggingface_Transformers Sequence Classification example #2786 @nathanweeks
- Added dead link checking in md files #2984 @mreso
- Added comments in model_service_worker.py #2809 @InakiRaba91
- Enabled a new github workflow or updated an existing workflow #2726 #2732 #2737 #2734 #2750 #2767 #2778 #2792 #2835 #2846 #2848 #2855 #2856 #2859 #2864 #2863 #2891 #2938 #2939 #2961 #2960 #2964 #3009 @agunapal @udaij12 @namannandan @sachanub
Documentation
- Updated security readme #2773 #3020 @agunapal @udaij12
- Added security readme to TorchServe site #2784 @sekyondaMeta
- Refactor the README.md #2729 @chauhang
- Updated git clone instruction in gRPC api documentation #2799 @bryant1410
- Highlighted code in README #2805 @bryant1410
- Fixed typos in the README.md #2806 #2871 @bryant1410 @rafijacSense
- Fixed dead links in documentation #2936 @agunapal
Platform Support
Ubuntu 20.04 MacOS 10.14+, Windows 10 Pro, Windows Server 2019, Windows subsystem for Linux (Windows Server 2019, WSLv1, Ubuntu 18.0.4). TorchServe now requires Python 3.8 and above, and JDK17.
GPU Support Matrix
TorchServe version | PyTorch version | Python | Stable CUDA | Experimental CUDA |
---|---|---|---|---|
0.10.0 | 2.2.1 | >=3.8, <=3.11 | CUDA 11.8, CUDNN 8.7.0.84 | CUDA 12.1, CUDNN 8.9.2.26 |
0.9.0 | 2.1 | >=3.8, <=3.11 | CUDA 11.8, CUDNN 8.7.0.84 | CUDA 12.1, CUDNN 8.9.2.26 |
0.8.0 | 2.0 | >=3.8, <=3.11 | CUDA 11.7, CUDNN 8.5.0.96 | CUDA 11.8, CUDNN 8.7.0.84 |
0.7.0 | 1.13 | >=3.7, <=3.10 | CUDA 11.6, CUDNN 8.3.2.44 | CUDA 11.7, CUDNN 8.5.0.96 |
Inferentia2 Support Matrix
TorchServe version | PyTorch version | Python | Neuron SDK |
---|---|---|---|
0.10.0 | 1.13 | >=3.8, <=3.11 | 2.16+ |
0.9.0 | 1.13 | >=3.8, <=3.11 | 2.13.2+ |