Release TorchServe v0.10.0 Release Notes · pytorch/serve

This is the release of TorchServe v0.10.0.

Highlights include

Extended support for PyTorch 2.x inference
C++ backend
GenAI fast series torch.compile showcase examples
Token authentication support for enhanced security.

C++ Backend

TorchServe presented the experimental C++ backend at the PyTorch Conference 2022. Similar to the Python backend, C++ backend also runs as a process and utilizes the BaseHandler to define APIs for customizing the handler. By providing a backend and handler written in pure C++ for TorchServe, it is now possible to deploy PyTorch models without any Python overhead. This release officially promoted the experimental branch to the master branch and included additional examples and Docker images for development.

Refactored C++ backend branch and promoted it to master #2840 #2927 #2937 #2953 #2975 #2980 #2958 #3006 #3012 #3014 #3018 @mreso
C++ backend examples:
a. Example Baby Llama #2903 #2911 @shrinath-suresh @mreso
b. Example Llama2 #2904 @shrinath-suresh @mreso
C++ dev Docker for CPU and GPU #2976 #3015 @namannandan

torch.compile

With the launch of PT2 Inference at the PyTorch Conference 2023, we have added several key examples showcasing out-of-box speedups for torch.compile and AOT Compile. Since there is no new development being done in TorchScript, starting this release, TorchServe is preparing the migration path for customers to switch from TorchScript to torch.compile.

GenAI torch.compile series

The fast series GenAI models - GPTFast, SegmentAnythingFast, DiffusionFast with 3-10x speedups using torch.compile and native PyTorch optimizations:

Example GPT Fast #2815 #2834 #2935 @mreso and deployment with KServe #2966 #2895 @agunapal
Example Segment Anything Fast #2802 @agunapal
Example Diffusion Fast #2902 @agunapal

Cold start problem solution

To address cold start problems, there is an example included to show how torch._export.aot_load (experimental API) can be used to load a pre-compiled model. TorchServe has also started benchmarking models with torch.compile and tracking their performance compared to TorchScript.

The new TorchServe C++ backend also includes torch.compile and AOTInductor related examples for ResNet50, BERT and Llama2.

torch.compile
a. Example torch.compile with image classifier model densenet161 #2915 @agunapal
b. Example torch._export.aot_compile with image classification model ResNet-18 #2832 #2906 #2932 #2948 @agunapal
c. Example torch inductor fx graph caching with image classification model densenet161 #2925 @agunapal
C++ AOTInductor
a. Example AOT Inductor with Llama2 #2913 @mreso
b. Example AOT Inductor with ResNet-50 #2944 @lxning
c. Example AOT Inductor with BERTSequenceClassification #2931 @lxning

Gen AI

Supported sequence batching for stateful inference in gRPC bi-directional streaming #2513 @lxning
The fast series Gen AI models using torch.compile and native PyTorch optimizations.
Example Mistral 7B with vLLM #2781 @agunapal
Example PyTorch native tensor parallel with Llama2 with continuous batching #2709 @mreso @HamidShojanazeri
Supported inf2 Neuronx transformer continuous batching for both no coding style and advanced customers with Llama2-70B example #2803 #3016 @lxning
Example deepspeed mii fastgen with Llama2-13B #2779 @lxning

Security

TorchServe has implemented token authentication for management and inference APIs. This is an optional config and can be enabled using torchserve-endpoint-plugin. This plugin can be downloaded from maven. This further strengthens TorchServe’s capability as a secure model serving solution. The security features of TorchServe are documented here

Supported token authentication in management and inference APIs,
#2888 #2970 #3002 @udaij12

Apple Silicon Support

TorchServe is now supported on Apple Silicon mac. The current support is for CPU only. We have also posted an RFC for the deprecation of x86 mac support.

Include arm64 mac in CI workflows #2934 @udaij12
Conda binaries build support #3013 @udaij12
Adding support for regression tests for binaries #3019 @udaij12

KServe Updates

While serving large models, model loading can take some time even though the pod is running. Even though TorchServe is up, the worker is not ready till the model is loaded. To address this, TorchServe now sets the model ready status in KServe after the model has been loaded on workers. TorchServe also includes native open inference protocol support in gRPC. This is an experiment feature.

Supported native KServe open inference protocol in gRPC #2609 @andyi2it
Refactored TorchServe configuration in KServe #2995 @sgaist
Improved KServe protocol version handling #2957 @sgaist
Updated KServe test script to return model version #2973 @agunapal
Set model status using TorchServe API in KServe #1878 @byeongjokim
Supported no-archive model archiver in KServe #2839 @agunapal
How to deploy MNIST using KServe with minikube #2718 @agunapal
Changes to support no-model archive mode with KServe #2839 @agunpal

Metrics Updates

In order to extend backwards compatibility support for metrics, auto-detection of backend metrics enables the flexibility to publish custom model metrics without having to explicitly specify them in the metrics configuration file. Furthermore, a customized script to collect system metrics is also now supported.

Supported backend metrics auto-detection #2769 @namannandan
Fixed backend metrics backward compatible #2816 @namannandan
Supported customized system metrics script via config.properties #3000 @lxning

Improvements and Bug Fixing

Supported PyTorch 2.2.1 #2959 #2972 and Release version updated #3010 @agunapal
Enabled option of installing model's 3rd party dependency in Python virtual environment via model config yaml file #2910 #2946 #2954 @namannandan
Fixed worker auto recovery #2746 @mreso
Fixed worker thread write and flush incomplete #2833 @lxning
Fixed the priority of parameters defined in register curl vs model-config.yaml #2858 @lxning
Refactored sanity check with pytest #2221 @mreso
Fixed model state if runtime is null from model archiver #2928 @mreso
Refactored benchmark script for LLM benchmark integration #2897 @mreso
Added pytest for tensor parallel #2741 @mreso
Fixed continuous batching unit test #2847 @mreso
Added separate pytest for send_intermediate_prediction_response #2896 @mreso
Fixed GPU ID in GPT Fast handler #2872 @sachanub
Added model archiver API #2751 @GeeCastro
Updated torch.compile in BaseHandler to accept kwargs via model config yaml file #2796 @eballesteros
Integrated pytorch-probot into the TorchServe #2725 @atalman
Added queue time in benchmark report #2854 @sachanub
Replaced no_grad with inference_mode in BaseHandler #2804 @bryant1410
Fixed env var CUDA_VERSION conflict in Dockerfile #2807 @rsbowman-striveworks
Fixed var USE_CUDA_VERSION in Dockerfile #2982 @fyang93
Fixed BASE_IMAGE for k8s docker image #2808 @rsbowman-striveworks
Fixed workflow store path in config.properties overwritten by the default workflow path #2792 @udaij12
Removed invalid warning log #2867 @lxning
Updated PyTorch nightly url and CPU version in install_dependency.py #2971 #3011 @agunapal
Deprecated Dockerfile.dev, build dev and prod docker image from single source Dockerfile #2782 @sachanub
Updated transformers version to >= 4.34.0 #2703 @agunapal
Fixed Neuronx requirements #2887 #2900 @namannandan
Added neuron SDK installation in install_dependencies.py #2893 @mreso
Updated ResNet-152 example output #2745 @sachanub
Clarified that "Not Accepted" is a valid classification in Huggingface_Transformers Sequence Classification example #2786 @nathanweeks
Added dead link checking in md files #2984 @mreso
Added comments in model_service_worker.py #2809 @InakiRaba91
Enabled a new github workflow or updated an existing workflow #2726 #2732 #2737 #2734 #2750 #2767 #2778 #2792 #2835 #2846 #2848 #2855 #2856 #2859 #2864 #2863 #2891 #2938 #2939 #2961 #2960 #2964 #3009 @agunapal @udaij12 @namannandan @sachanub

Documentation

Updated security readme #2773 #3020 @agunapal @udaij12
Added security readme to TorchServe site #2784 @sekyondaMeta
Refactor the README.md #2729 @chauhang
Updated git clone instruction in gRPC api documentation #2799 @bryant1410
Highlighted code in README #2805 @bryant1410
Fixed typos in the README.md #2806 #2871 @bryant1410 @rafijacSense
Fixed dead links in documentation #2936 @agunapal

Platform Support

Ubuntu 20.04 MacOS 10.14+, Windows 10 Pro, Windows Server 2019, Windows subsystem for Linux (Windows Server 2019, WSLv1, Ubuntu 18.0.4). TorchServe now requires Python 3.8 and above, and JDK17.

GPU Support Matrix

TorchServe version	PyTorch version	Python	Stable CUDA	Experimental CUDA
0.10.0	2.2.1	>=3.8, <=3.11	CUDA 11.8, CUDNN 8.7.0.84	CUDA 12.1, CUDNN 8.9.2.26
0.9.0	2.1	>=3.8, <=3.11	CUDA 11.8, CUDNN 8.7.0.84	CUDA 12.1, CUDNN 8.9.2.26
0.8.0	2.0	>=3.8, <=3.11	CUDA 11.7, CUDNN 8.5.0.96	CUDA 11.8, CUDNN 8.7.0.84
0.7.0	1.13	>=3.7, <=3.10	CUDA 11.6, CUDNN 8.3.2.44	CUDA 11.7, CUDNN 8.5.0.96

Inferentia2 Support Matrix

TorchServe version	PyTorch version	Python	Neuron SDK
0.10.0	1.13	>=3.8, <=3.11	2.16+
0.9.0	1.13	>=3.8, <=3.11	2.13.2+

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TorchServe v0.10.0 Release Notes