The official repository of our paper "Towards Scalable Neural Representation for Diverse Videos".
You can install the conda environment by running:
conda create -n dnerv python=3.9.7
conda activate dnerv
conda install pytorch torchvision pytorch-cuda=11.7 -c pytorch -c nvidia
pip install tensorboard
pip install tqdm dahuffman pytorch_msssim
We adopt the existing deep image compression models provided by CompressAI. We provide the pre-extracted ground-truth video frames and pre-compressed keyframes for UVG and UCF101 datasets in this google drive link.
Unzip it under the data/
folder and make sure the data structure is as below.
├── data
└── UVG
├── gt
├── keyframe
├── annotation
└── UCF101
├── gt
├── keyframe
├── annotation
Please note that, we split the 1024x1920 UVG videos into non-overlap 256x320 frame patches during training due to the GPU memory limitation.
We train our model on 4 RTX-A6000 GPUs. To compare with other state-of-the-art video compression methods, we run for 1600 epochs on UVG dataset and 800 epochs on UCF101 dataset. You can change to a smaller number of epochs to reduce the training time.
# UVG datset
python train.py --dataset UVG --model_type ${model_type} --model_size ${model_size} \
-e 1600 -b 32 --lr 5e-4 --loss_type Fusion6 -d
# UCF101 datset
python train.py --dataset UCF101 --model_type ${model_type} --model_size ${model_size} \
-e 800 -b 32 --lr 5e-4 --loss_type Fusion19 -d
# Evaluate model without model quantization
python train.py --dataset UVG --model_type D-NeRV --model_size M \
--eval_only --model saved_model/UVG/D-NeRV_M.pth
# Evaluate model with model quantization
python train.py --dataset UVG --model_type D-NeRV --model_size M \
--eval_only --model saved_model/UVG/D-NeRV_M.pth --quant_model
python train.py --dataset UVG --model_type D-NeRV --model_size M \
--eval_only --model saved_model/UVG/D-NeRV_M.pth --quant_model \
--dump_images
Please note that, for the UVG dataset, after we splitting 1024x1920 videos into 256x320 frame patches, the PSNR/MS-SSIM results will be different from the actual PSNR/MS-SSIM of 1024x1920. Therefore, we need to dump the predicted frame patches first, and then re-evaluate the PSNR/MS-SSIM with the ground-truth 1024x1980 video frames.
Results for different model configs are shown in the following table. The PSNR/MS-SSIM results are reported from the model with quantization.
Model | Arch | Model Param(M) | Entropy Encoding | Keyframe Size(Mb) | Total(Mb) | BPP | PNSR | MS-SSIM | Link |
---|---|---|---|---|---|---|---|---|---|
D-NeRV | XS | 8.02 | 0.883 | 88.39 | 145.0 | 0.0189 | 34.11 | 0.9479 | link |
D-NeRV | S | 15.96 | 0.881 | 88.39 | 200.9 | 0.0262 | 34.76 | 0.9540 | link |
D-NeRV | M | 24.20 | 0.880 | 123.2 | 293.6 | 0.0383 | 35.74 | 0.9604 | link |
D-NeRV | L | 41.66 | 0.877 | 175.1 | 467.3 | 0.0609 | 36.78 | 0.9668 | link |
D-NeRV | XL | 69.75 | 0.875 | 254.7 | 730.3 | 0.0952 | 37.43 | 0.9719 | link |
Model | Arch | Model Param(M) | Entropy Encoding | Keyframe Size(Mb) | Total(Mb) | BPP | PNSR | MS-SSIM | Link |
---|---|---|---|---|---|---|---|---|---|
D-NeRV | S | 21.40 | 0.882 | 481.6 | 632.7 | 0.0559 | 28.11 | 0.9153 | link |
D-NeRV | M | 38.90 | 0.891 | 481.6 | 758.7 | 0.0671 | 29.15 | 0.9364 | link |
D-NeRV | L | 61.30 | 0.891 | 481.6 | 918.3 | 0.0812 | 29.97 | 0.9501 | link |
NeRV | S | 88.00 | 0.903 | 635.9 | 0.0562 | 26.78 | 0.9094 | link |
|
NeRV | M | 105.3 | 0.900 | 758.4 | 0.0671 | 27.06 | 0.9177 | link |
|
NeRV | L | 127.2 | 0.903 | 919.1 | 0.0813 | 27.61 | 0.9284 | link |
For UVG dataset, H = 1024, W = 1920, Num Frames = 3900.
For UCF101 dataset, training split, H = 256, W = 320, Num Frames = 138041.
If you find our code or our paper useful for your research, please [★star] this repo and [cite] the following paper:
@inproceedings{he2023dnerv,
title = {Towards Scalable Neural Representation for Diverse Videos},
author = {He, Bo and Yang, Xitong and Wang, Hanyu and Wu, Zuxuan and Chen, Hao and Huang, Shuaiyi and Ren, Yixuan and Lim, Ser-Nam and Shrivastava, Abhinav},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2023},
}