Skip to content

Latest commit

 

History

History
254 lines (181 loc) · 12.1 KB

README.md

File metadata and controls

254 lines (181 loc) · 12.1 KB

Screenshot 2022-10-19 at 3 55 14 PM

Accelerate your machine learning and deep learning models by upto 10X

🔥UPDATE: Stable-Diffusion/DreamBooth Acceleration. Upto 2.5X speed up in inference🔥

voltaML is an open-source lightweight library to accelerate your machine learning and deep learning models. VoltaML can optimize, compile and deploy your models to your target CPU and GPU devices, with just one line of code.

animated

Out of the box support for

✅ FP16 Quantization

✅ Int8 Quantization*

✅ Hardware specific compilation


Screenshot 2022-10-17 at 12 06 26 PM


voltaML has compilation support for the following:

Screenshot 2022-06-13 at 3 43 03 PM

Installation

Own setup:

Requirements:

  • CUDA Version >11.x
  • TensorRT == 8.4.1.2
  • PyTorch == 1.12 cu11.x
  • NVIDIA Driver version > 510
git clone https://github.com/VoltaML/voltaML.git
cd voltaML
python setup.py install

Docker Container 🐳

docker pull voltaml/voltaml:v0.4
docker run -it --gpus=all -p "8888:8888" voltaml/voltaml:v0.4 \ 
        jupyter lab --port=8888 --no-browser --ip 0.0.0.0 --allow-root

Usage

import torch
from voltaml.compile import VoltaGPUCompiler, VoltaCPUCompiler, TVMCompiler
from voltaml.inference import gpu_performance

model = torch.load("path/to/model/dir")

# compile the model by giving paths
compiler = VoltaGPUCompiler(
        model=model,
        output_dir="destination/path/of/compiled/model",
        input_shape=(1, 3, 224, 224), # example input shape
        precision="fp16" # specify precision[fp32, fp16, int8] - Only for GPU compiler
        target="llvm" # specify target device - Only for TVM compiler
    )

# returns the compiled model
compiled_model = compiler.compile()

# compute and compare performance
gpu_performance(compiled_model, model, input_shape=(1, 3, 224, 224))
cpu_performance(compiled_model, model, compiler="voltaml", input_shape=(1, 3, 224, 224))
cpu_performance(compiled_model, model, compiler="tvm", input_shape=(1, 3, 224, 224))

Notebooks

  1. ResNet-50 image classification
  2. DeeplabV3_MobileNet_v3_Large Segmentation
  3. YOLOv5 Object Detection YOLOv5
  4. YOLOv6 Object Detection YOLOv6
  5. Bert_Base_Uncased Huggingface

Benchmarks

🖼️ Classification Models Inference Latency (on GPU) ⏱️

Classification has been done on Imagenet data, batch size = 1 and imagesize = 224 on NVIDIA RTX 2080Ti. In terms of top 1% and 5% accuracy for int8 models, we have not seen an accuracy drop of more than 1%.

Pytorch (ms), VoltaGPU FP16 (ms) and VoltaGPU int8 (ms)

Model Pytorch (ms) VoltaGPU FP16 (ms) VoltaGPU int8 (ms) Pytorch vs Int8 Speed
squeezenet1_1 1.6 0.2 0.2 8.4x
resnet18 2.7 0.4 0.3 9.0x
resnet34 4.5 0.7 0.5 9.0x
resnet50 6.6 0.7 0.5 13.2x
resnet101 13.6 1.3 1.0 13.6x
densenet121 15.7 2.4 2.0 7.9x
densenet169 22.0 4.4 3.8 5.8x
densenet201 26.8 6.3 5.0 5.4x
vgg11 2.0 0.9 0.5 4.0x
vgg16 3.5 1.2 0.7 5.0x

🧐 Object Detection (YOLO) Models Inference Latency (on GPU) ⏱️

Object Detection inference was done on a dummy data with imagesize = 640 and batch size = 1 on NVIDIA RTX 2080Ti.

Pytorch (ms) and VoltaGPU FP16 (ms)

Model Pytorch (ms) VoltaGPU FP16 (ms) Pytorch vs FP16 Speed
YOLOv5n 5.2 1.2 4.3x
YOLOv5s 5.1 1.6 3.2x
YOLOv5m 9.1 3.2 2.8x
YOLOv5l 15.3 5.1 3.0x
YOLOv5x 30.8 6.4 4.8x
YOLOv6s 8.8 3.0 2.9x
YOLOv6l_relu 23.4 5.5 4.3x
YOLOv6l 18.1 4.1 4.4x
YOLOv6n 9.1 1.6 5.7x
YOLOv6t 8.6 2.4 3.6x
YOLOv5m 15.5 3.5 4.4x

🎨 Segmentation Models Inference Latency (on GPU) ⏱️

Segmentation inference was done on a dummy data with imagesize = 224 and batch size = 1 on NVIDIA RTX 2080Ti.

Pytorch (ms), VoltaGPU FP16 (ms) and VoltaGPU Int8 (ms)(1)

Model Pytorch (ms) VoltaGPU FP16 (ms) VoltaGPU Int8 (ms) Speed Up (X)
FCN_Resnet50 8.3 2.3 1.8 3.6x
FCN_Resnet101 14.7 3.5 2.5 5.9x
DeeplabV3_Resnet50 12.1 2.5 1.3 9.3x
DeeplabV3_Resnet101 18.7 3.6 2.0 9.4x
DeeplabV3_MobileNetV3_Large 6.1 1.5 0.8 7.6x
DeeplabV3Plus_ResNet50 6.1 1.1 0.8 7.6x
DeeplabV3Plus_ResNet34 4.7 0.9 0.8 5.9x
UNet_ResNet50 6.2 1.3 1 6.2x
UNet_ResNet34 4.3 1.1 0.8 5.4x
FPN_ResNet50 5.5 1.2 1 5.5x
FPN_ResNet34 4.2 1.1 1 4.2x

🤗 Accelerating Huggingface Models using voltaML

We're adding support to accelerate Huggingface NLP models with voltaML. This work has been inspired from ELS-RD's work. This is still in the early stages and only few models listed in the below table are supported. We're working to add more models soon.

from voltaml.compile import VoltaNLPCompile
from voltaml.inference import nlp_performance


model='bert-base-cased'
backend=["tensorrt","onnx"] 
seq_len=[1, 1, 1] 
task="classification"
batch_size=[1,1,1]

VoltaNLPCompile(model=model, device='cuda', backend=backend, seq_len=seq_len)

nlp_performance(model=model, device='cuda', backend=backend, seq_len=seq_len)

Pytorch (ms) and VoltaML FP16 (ms)

Model Pytorch (ms) VoltaML FP16 (ms) SpeedUp
bert-base-uncased 6.4 1 6.4x
Jean-Baptiste/camembert-ner 6.3 1 6.3x
gpt2 6.6 1.2 5.5x
xlm-roberta-base 6.4 1.08 5.9x
roberta-base 6.6 1.09 6.1x
bert-base-cased 6.2 0.9 6.9x
distilbert-base-uncased 3.5 0.6 5.8x
roberta-large 11.9 2.4 5.0x
deepset/xlm-roberta-base-squad2 6.2 1.08 5.7x
cardiffnlp/twitter-roberta-base-sentiment 6 1.07 5.6x
sentence-transformers/all-MiniLM-L6-v2 3.2 0.42 7.6x
bert-base-chinese 6.3 0.97 6.5x
distilbert-base-uncased-finetuned-sst-2-english 3.4 0.6 5.7x
albert-base-v2 6.7 1 6.7x

voltaTrees ⚡🌴 -> Link

A LLVM-based compiler for XGBoost and LightGBM decision trees.

voltatrees converts trained XGBoost and LightGBM models to optimized machine code, speeding-up prediction by ≥10x.

Example

import voltatrees as vt

model = vt.XGBoostRegressor.Model(model_file="NYC_taxi/model.txt")
model.compile()
model.predict(df)

Installation

git clone git clone https://github.com/VoltaML/volta-trees.git
cd volta-trees/
pip install -e .

Benchmarks

On smaller datasets, voltaTrees is 2-3X faster than Treelite by DMLC. Testing on large scale dataset is yet to be conducted.

Enterpise Platform 🛣️

Any enterprise customers who would like a fully managed solution hosted on your own cloud, please contact us at harish@voltaml.com

  • Fully managed and cloud-hosted optimization engine.
  • Hardware targeted optimised dockers for maximum performance.
  • One-click deployment of the compiled models.
  • Cost-benefit analysis dashboard for optimal deployment.
  • NVIDIA Triton optimzed dockers for large-scale GPU deployment.
  • Quantization-Aware-Training (QAT)