Skip to content

Latest commit

 

History

History
234 lines (153 loc) · 7.53 KB

README.md

File metadata and controls

234 lines (153 loc) · 7.53 KB

HumanNeRF: Free-viewpoint Rendering of Moving People from Monocular Video (CVPR 2022)

Project Page | Paper | Video

This is an official implementation. The codebase is implemented using PyTorch and tested on Ubuntu 20.04.4 LTS.

Prerequisite

Configure environment

Install Miniconda (recommended) or Anaconda.

Create and activate a virtual environment.

conda create --name humannerf python=3.7
conda activate humannerf

Install the required packages.

pip install -r requirements.txt

Download SMPL model

Download the gender neutral SMPL model from here, and unpack mpips_smplify_public_v2.zip.

Copy the smpl model.

SMPL_DIR=/path/to/smpl
MODEL_DIR=$SMPL_DIR/smplify_public/code/models
cp $MODEL_DIR/basicModel_neutral_lbs_10_207_0_v1.0.0.pkl third_parties/smpl/models

Follow this page to remove Chumpy objects from the SMPL model.

Run on ZJU-Mocap Dataset

Below we take the subject 387 as a running example.

Prepare a dataset

First, download ZJU-Mocap dataset from here.

Second, modify the yaml file of subject 387 at tools/prepare_zju_mocap/387.yaml. In particular, zju_mocap_path should be the directory path of the ZJU-Mocap dataset.

dataset:
    zju_mocap_path: /path/to/zju_mocap
    subject: '387'
    sex: 'neutral'

...

Finally, run the data preprocessing script.

cd tools/prepare_zju_mocap
python prepare_dataset.py --cfg 387.yaml
cd ../../

Train/Download models

Now you can either download a pre-trained model by running the script.

./scripts/download_model.sh 387

or train a model by yourself. We used 4 GPUs (NVIDIA RTX 2080 Ti) to train a model.

python train.py --cfg configs/human_nerf/zju_mocap/387/adventure.yaml

For sanity check, we provide a configuration that supports training on a single GPU (NVIDIA RTX 2080 Ti). Notice the performance is not guranteed for this configuration.

python train.py --cfg configs/human_nerf/zju_mocap/387/single_gpu.yaml

Render output

Render the frame input (i.e., observed motion sequence).

python run.py \
    --type movement \
    --cfg configs/human_nerf/zju_mocap/387/adventure.yaml 

Run free-viewpoint rendering on a particular frame (e.g., frame 128).

python run.py \
    --type freeview \
    --cfg configs/human_nerf/zju_mocap/387/adventure.yaml \
    freeview.frame_idx 128

Render the learned canonical appearance (T-pose).

python run.py \
    --type tpose \
    --cfg configs/human_nerf/zju_mocap/387/adventure.yaml 

In addition, you can find the rendering scripts in scripts/zju_mocap.

Run on a Custom Monocular Video

To get the best result, we recommend a video clip that meets these requirements:

  • The clip has less than 600 frames (~20 seconds).
  • The human subject shows most of body regions (e.g., front and back view of the body) in the clip.

Prepare a dataset

To train on a monocular video, prepare your video data in dataset/wild/monocular with the following structure:

monocular
    ├── images
    │   └── ${item_id}.png
    ├── masks
    │   └── ${item_id}.png
    └── metadata.json

We use item_id to match a video frame with its subject mask and metadata. An item_id is typically some alphanumeric string such as 000128.

images

A collection of video frames, stored as PNG files.

masks

A collection of subject segmentation masks, stored as PNG files.

metadata.json

This json file contains metadata for video frames, including:

  • human body pose (SMPL poses and betas coefficients)
  • camera pose (camera intrinsic and extrinsic matrices). We follow OpenCV camera coordinate system and use pinhole camera model.

You can run SMPL-based human pose detectors (e.g., SPIN, VIBE, or ROMP) on a monocular video to get body poses as well as camera poses.

{
  // Replace the string item_id with your file name of video frame.
  "item_id": {
        // A (72,) array: SMPL coefficients controlling body pose.
        "poses": [
            -3.1341, ..., 1.2532
        ],
        // A (10,) array: SMPL coefficients controlling body shape. 
        "betas": [
            0.33019, ..., 1.0386
        ],
        // A 3x3 camera intrinsic matrix.
        "cam_intrinsics": [
            [23043.9, 0.0,940.19],
            [0.0, 23043.9, 539.23],
            [0.0, 0.0, 1.0]
        ],
        // A 4x4 camera extrinsic matrix.
        "cam_extrinsics": [
            [1.0, 0.0, 0.0, -0.005],
            [0.0, 1.0, 0.0, 0.2218],
            [0.0, 0.0, 1.0, 47.504],
            [0.0, 0.0, 0.0, 1.0],
        ],
  }

  ...

  // Iterate every video frame.
  "item_id": {
      ...
  }
}

Once the dataset is properly created, run the script to complete dataset preparation.

cd tools/prepare_wild
python prepare_dataset.py --cfg wild.yaml
cd ../../

Train a model

Now we are ready to lanuch a training. By default, we used 4 GPUs (NVIDIA RTX 2080 Ti) to train a model.

python train.py --cfg configs/human_nerf/wild/monocular/adventure.yaml

For sanity check, we provide a single-GPU (NVIDIA RTX 2080 Ti) training config. Note the performance is not guaranteed for this configuration.

python train.py --cfg configs/human_nerf/wild/monocular/single_gpu.yaml

Render output

Render the frame input (i.e., observed motion sequence).

python run.py \
    --type movement \
    --cfg configs/human_nerf/wild/monocular/adventure.yaml 

Run free-viewpoint rendering on a particular frame (e.g., frame 128).

python run.py \
    --type freeview \
    --cfg configs/human_nerf/wild/monocular/adventure.yaml \
    freeview.frame_idx 128

Render the learned canonical appearance (T-pose).

python run.py \
    --type tpose \
    --cfg configs/human_nerf/wild/monocular/adventure.yaml 

In addition, you can find the rendering scripts in scripts/wild.

Acknowledgement

The implementation took reference from NeRF-PyTorch, Neural Body, Neural Volume, LPIPS, and YACS. We thank the authors for their generosity to release code.

Citation

If you find our work useful, please consider citing:

@InProceedings{weng_humannerf_2022_cvpr,
    title     = {Human{N}e{RF}: Free-Viewpoint Rendering of Moving People From Monocular Video},
    author    = {Weng, Chung-Yi and 
                 Curless, Brian and 
                 Srinivasan, Pratul P. and 
                 Barron, Jonathan T. and 
                 Kemelmacher-Shlizerman, Ira},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    month     = {June},
    year      = {2022},
    pages     = {16210-16220}
}