Releases: AmusementClub/vs-mlrt
v12: latest CUDA libraries
Compared to v11, this release updated CUDA dependencies to CUDA 11.8.0, cuDNN 8.6.0 and TensorRT 8.5.1:
- Added support for the NVIDIA 40 series GPUs.
- Added support for RIFE on the
trt
backend.
Known issue
- Performance of the
OV_CPU
orORT_CUDA(fp16=True)
backends forRIFE
is lower than expected, which is under investigation. Please considerORT_CPU
orORT_CUDA(fp16=False)
for now. - The
NCNN_VK
backend does not supportRIFE
.
Installation Notes
For some advanced features, vsmlrt.py
requires numpy
and onnx
packages to be available. You might need to run pip install onnx numpy
.
Benchmark
Configuration: NVIDIA RTX 3090, driver 526.47, windows server 2019, vs r60, python 3.11.0, 1080p fp16
Backends: ort-cuda, trt from vs-mlrt v12.
For the trt
backend, the engine is created without CUDA_MODULE_LOADING=LAZY
environment variable and with it during benchmarking to reduce device memory consumption.
Data format: fps / GPU memory usage (MB)
rife(model=44, 1920x1088)
backend | 1 stream | 2 streams |
---|---|---|
ort-cuda | 53.62/1771 | 83.34/2748 |
trt | 71.30/ 626 | 107.3/ 962 |
dpir color
backend | 1 stream | 2 streams |
---|---|---|
ort-cuda | 4.64/3230 | |
trt | 10.32/1992 | 11.61/3475 |
waifu2x upconv_7
backend | 1 stream | 2 streams |
---|---|---|
ort-cuda | 11.07/5916 | 15.04/10899 |
trt | 18.38/2092 | 31.64/ 3848 |
waifu2x cunet
backend | 1 stream | 2 streams |
---|---|---|
ort-cuda | 4.63/8541 | 5.32/16148 |
trt | 11.44/4771 | 15.59/ 8972 |
realesrgan v2/v3
backend | 1 stream | 2 streams |
---|---|---|
ort-cuda | 8.84/2283 | 11.10/4202 |
trt | 14.59/1324 | 21.37/2174 |
v11 RIFE support
Added support for the RIFE video frame interpolation algorithm.
There are two APIs for RIFE:
vsmlrt.RIFE
is a high-level API for interpolating a clip. set themulti
argument to specify the fps factor. Just remember to perform scene detection on the input clip.vsmlrt.RIFEMerge
is a novel temporalstd.MaskedMerge
-like interface for RIFE. Use it if you want to precisely control the frames and/or time point for the interpolation.
Known issues
-
vstrt doesn't support RIFE for the moment1. The next release of TensorRT should include RIFE support and we will release v12 when that happens.
-
vstrt backend also doesn't yet support latest RTX 4000 series GPUs. This will be fixed after upgrading to the upcoming TensorRT 8.5 release. RTX 4000 series GPU owners please use other the other CUDA backends.
-
Users of the
OV_GPU
backend may experience errors likeExceeded max size of memory object allocation: Requested 11456040960 bytes but max alloc size is 4294959104 bytes
. Please consider tiling for now.The reason is that the openvino library follows the opencl standard on memory object allocation restriction (
CL_DEVICE_MAX_MEM_ALLOC_SIZE
). For most existing intel gpus (gen9 and later), the driver imposes a maximum allocation size of ~4GiB2.
-
It's missing grid_sample operator support, see https://github.com/onnx/onnx-tensorrt/blob/main/docs/operators.md. ↩
-
this value is derived from here, which states that device not supporting
sharedSystemMemCapabilities
has a maximum allowed allocation size of 4294959104 bytes ↩
v11.test
internal testing only.
Added support for the RIFE video frame interpolation algorithm. Some features are still being implemented. The Python RIFE model wrapper interface is still subject to change.
Known issue
-
Users of the
OV_GPU
backend may experience errors likeExceeded max size of memory object allocation: Requested 11456040960 bytes but max alloc size is 4294959104 bytes
. Please consider tiling for now.The reason is that the openvino library follows the opencl standard on memory object allocation restriction (
CL_DEVICE_MAX_MEM_ALLOC_SIZE
). For most existing intel gpus (gen9 and later), the driver imposes a maximum allocation size of ~4GiB1.
-
this value is derived from here, which states that device not supporting
sharedSystemMemCapabilities
has a maximum allowed allocation size of 4294959104 bytes ↩
Model Release 20220923, RIFE model
New modules (compared to previous model release):
- RIFE v4.0 from vs-rife v2.0.0.
rife/rife_v4.0.onnx
, config:fastmode=True, ensemble=False
- RIFE v4.2, v4.3, v4.4, v4.5, v4.6, v4.7, v4.8, v4.9, v4.10 from Practical-RIFE.
rife/rife_{v4.2,v4.3,v4.4,v4.5,v4.6,v4.7,v4.8,v4.9,v4.10}.onnx
, config:fastmode=True, ensemble=False
- Other provided RIFE models can be found here, including v2 representation of RIFE v4.7-v4.10 models. Sorry for the inconvenience.
Notes:
- For RIFE on ort-gpu, vs-mlrt v11 or later is suggested for best performance. And (as of v11), only ov-cpu, ort-cpu, ort-cuda, trt (pending new TensorRT release) support RIFE. Specifically, ncnn-vk do not support RIFE due to missing
gridsample
op.
v10: new vulkan based vsncnn (AMD GPU supported)
Release Highlight
Vulkan based AMD GPU support added with the new vsncnn-vk backend.
Major features
- Introduced ncnn-based vsncnn plugin that supports any GPU with Vulkan support (NVidia, AMD, Intel integrated & discrete).
- Good news for AMD GPU users! vs-mlrt has finally achieved full platform coverage: from x86 CPU to GPU of all three major vendors.
- Please refer to the benchmark below for performance details. Tl;dr it's comparable to vsort-cuda on most networks (except waifu2x-cunet), but (significantly) slower than vstrt. Owing to its C++ implementation, it's generally faster than Python based ncnn implementations.
- Hint: If your GPU has enough memory, please consider setting
num_streams>1
to extract more performance. - Even though it's possible to use software based Vulkan implementations (as we did in the GHA tests), if you want to do CPU-only inference, it's much better to use vsov-cpu (or vsort-cpu).
- Introduced a new smaller Vulkan-based GPU binary package (
vsmlrt-windows-x64-vk.v10.7z
) that only includes vsov-{cpu,gpu}, vsort-cpu and vsncnn-vk. Use this if you only use Intel/AMD GPU or don't want to download 1GB data in exchange for a backend that is merely 2~8x faster. Now there shouldn't be any reasons not to use vs-mlrt.
Benchmark
Configuration: NVIDIA RTX 3090, driver 516.94, windows server 2019, vs r60, python 3.10.7, 1080p fp16
Backends: ncnn-vk, ort-cuda, trt from vs-mlrt v10, dpir-ncnn v2.0.0, w2xncnnvk r2
Data format: fps / GPU memory usage (MB)
dpir color
backend | 1 stream | 2 streams |
---|---|---|
ncnn-vk | 4.33/3347 | 4.72/6119 |
ort-cuda | 4.56/3595 | |
trt | 10.64/2595 | 11.10/4593 |
dpir-ncnn | 3.68/3326 |
waifu2x upconv_7
backend | 1 stream | 2 streams |
---|---|---|
ncnn-vk | 9.46/6820 | 14.71/13468 |
ort-cuda | 12.10/6411 | 13.98/11273 |
trt | 21.32/3317 | 29.10/ 5053 |
w2xncnnvk | 6.68/6931 | 12.70/13626 |
waifu2x cunet
backend | 1 stream | 2 streams |
---|---|---|
ncnn-vk | 1.46/11908 | 1.53/23574 |
ort-cuda | 4.85/ 8793 | 5.18/16231 |
trt | 11.60/ 4960 | 15.60/ 9057 |
w2xncnnvk | 1.38/11966 | 1.58/23687 |
realesrgan v2/v3
backend | 1 stream | 2 streams |
---|---|---|
ncnn-vk | 7.23/2781 | 8.35/5330 |
ort-cuda | 9.05/2669 | 10.18/4539 |
trt | 15.93/1667 | 19.58/2543 |
v10.pre
This is a pre-release for testing & benchmarking purposes only.
For production use, please use the official v10 release.
Release Highlight
Vulkan based AMD GPU support added with the new vsncnn-vk backend.
Major features
- Introduced ncnn-based vsncnn plugin that supports any GPU with Vulkan support (NVidia, AMD, Intel integrated & discrete). Good news for AMD GPU users! vs-mlrt has finally achieved full platform coverage: from x86 CPU to GPU of all three major vendors.
- Introduced a new smaller Vulkan-based GPU binary package (
vsmlrt-windows-x64-vk.v10.pre.7z
) that only includes vsov-{cpu,gpu}, vsort-cpu and vsncnn-vk. Use this if you only use Intel/AMD GPU or don't want to download 1GB data in exchange for a backend that is merely 3x faster. Now there shouldn't be any reasons not to use vs-mlrt.
v9.2
Fixed issues
- In vs-mlrt v9 and v9.1 on windows, the
ORT_CUDA
backend may fails for out of memory when processing a noninitial frame. This has been fixed and the performance should be improved. - Parameter
use_cuda_graph
of theORT_CUDA
backend now works properly on windows. It is however not recommended to use currently.
Full Changelog: v9.1...v9.2
v9.1
Bugfix release for v9. Recommended update for v9 users.
Please see release notes for v9 to see all the major new features.
-
Fix ort_cuda fp16 inference for
CUGAN(version=2)
model.A new parameter
fp16_blacklist_ops
is introduced in ort and ov backends for other issues possibly related to reduced precision.Please still carefully review the output of fp16 accelerated
CUGAN(version=2)
. -
Conform with
CUGAN(version=2)
's dynamic range compression. This feature is enabled by settingconformance=True
(which is the default) in theCUGAN
wrapper invsmlrt.py
, and it's implemented as:clip = clip.std.Expr("x 0.7 * 0.15 +") clip = CUGAN(clip, version=2) clip = clip.std.Expr("x 0.15 - 0.7 /")
Known issues
- These two issues are fixed in the v9.2 release.
- The
ORT_CUDA
backend allocates memory during inference. This degrades performance and may results in out of memory error. - Parameter
use_cuda_graph
of theORT_CUDA
backend is broken on Windows.
- The
Full Changelog: v9...v9.1
v9 Major release: Intel GPU support & much more
This is a major release.
-
Added support for Intel GPUs (both discrete [Xe Arc series] and integrated [Gen 8+ on Broadwell+])
- In
vsmlrt.py
, this corresponds to theOV_GPU
backend. - The openvino library is now dynamically linked because of the integration of oneDNN for GPU.
- In
-
Added support for
RealESRGANv3
andcugan-pro
models. -
Upgraded CUDA toolkit to 11.7.0, TensorRT to 8.4.1 and cuDNN to 8.4.1. It is now possible to build TRT engines for
CUGAN
, waifu2xcunet
andupresnet10
models on RTX 2000 and RTX 3000 series GPUs. -
The trt backend in
vsmlrt.py
wrapper now creates a log file fortrtexec
output in the TEMP directory (this only works if using the bundledtrtexec.exe
.) The log file will only be retained iftrtexec
fails (and the vsmlrt exception message will include the full path of the log file.) If you want the log to go to a specific file, set environment variableTRTEXEC_LOG_FILE
to the absolute path of the log file. If you don't want this behavior, setlog=False
when creating the backend (e.g.vsmlrt.Backend.TRT(log=False)
) -
The cuda bundles now include VC runtime DLLs as well, so
trtexec.exe
should run even on systems without proper VC runtime redistributable packages installed (e.g. freshly installed Windows). -
The ov backend can now configure model compilation via
config
. Available configurations can be found here.-
Example:
core.ov.Model(..., config = lambda: dict(CPU_THROUGHPUT_STREAMS=core.num_threads, CPU_BIND_THREAD="NO"))
This configuration may be useful in improving processor utilization at the expense of significantly increased memory consumption (only try this if you have a huge number of cores underutilized by the default settings.)
The equivalent form for the python wrapper is
backend = vsmlrt.Backend.OV_CPU(num_streams=core.num_threads, bind_thread=False)
-
-
When using the
vsmlrt.py
wrapper, it will no longer create temporary onnx files (e.g. when using non-defaultalpha
CUGAN parameters). Instead, the modified ONNX network will be passed directly into the various ML runtime filters. Those filters now supports(network_path=b'raw onnx protobuf serialization', path_is_serialization=True)
for this. This feature also opens the door for generating ONNX on the fly (e.g. ever dreamed of GPU accelerated 2d-convolution orstd.Expr
?)
Update Instructions
- Delete the previous
vsmlrt-cuda
,vsov
,vsort
andvstrt
directories andvsov.dll
,vsort.dll
andvstrt.dll
from your VS plugins directory and then extract the newly released files (specifically, do not leave files from previous version and just overwrite with the new release as the new release might have removed some files in those four directories.) - Replace
vsmlrt.py
in your Python package directory. - Updated
models
directories by overwriting with the new release. (Models are generally append only. We will make special notices and bump the model release tag if we change any of the previously released models.)
Compatibility Notes
vsmrt.py
in this release is not compatible with binaries in previous releases, only script level compatibility is maintained. Generally, please make sure to upgrade the filters and vsmlrt.py
as a whole.
We strive to maintain script source level compatibility as much as possible (i.e. there won't be a great api4 breakage), and it means script writing for v7 (for example) will continue to function for the foreseeable future. Minor issues (like the non-monotonic denoise setting of cugan) will be documented instead of fixed with a breaking change.
Known issue
CUGAN(version=2)
(a.k.a. cugan-pro) may produces blank clip when using the ORT_CUDA(fp16)
backend. This is fixed in the v10 release.
Full Changelog: v8...v9
v8: latest CUDA libraries and ~10% faster
- This release upgrades the cuda libraries to their latest version. Models are observed to be accelerated by ~1.1x.
vsmlrt.CUGAN()
now accepts a new parameteralpha
, which controls the strength of filtering. Settingalpha
to non-default values requires the Pythononnx
package (but this might change in the future.)- Added
tf32
parameter to the trt backend in vsmlrt.py. TF32 acceleration is enabled by default on the Ampere GPUs, mostly for fp32 inference, and it has no effect on other architectures.