AutoVFX: Physically Realistic Video Editing from Natural Language Instructions.
Hao-Yu Hsu1, Chih-Hao Lin1, Albert J. Zhai1, Hongchi Xia1, Shenlong Wang1
1University of Illinois at Urbana-Champaign
International Conference on 3D Vision (3DV), 2025
- Environment Setup
- Pretrained checkpoints, data, and software preparation
- Simulation example on Garden scene
- Details of pose extraction (SfM) and pose alignment
- Details of training 3DGS
- Details of surface reconstruction
- Details of estimating relative scene scale
- Code for sampling custom camera trajectory
- Local gradio app demo
The code has been tested on:
- OS: Ubuntu 22.04.5 LTS
- GPU: NVIDIA GeForce RTX 4090
- Driver Version: 550
- CUDA Version: 12.4
- nvcc: 11.8
- Create environment:
git clone https://github.com/haoyuhsu/autovfx.git
cd autovfx/
conda create -n autovfx python=3.10
conda activate autovfx
- Install PyTorch & cudatoolkit:
conda install pytorch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 pytorch-cuda=11.8 -c pytorch -c nvidia
# (Optional) To build the necessary CUDA extensions, cuda-toolkit is required.
conda install -c "nvidia/label/cuda-11.8.0" cuda-toolkit
- Install Gaussian Splatting submodules:
cd sugar/gaussian_splatting/
pip install submodules/diff-gaussian-rasterization
pip install submodules/simple-knn
- Install segmentation & tracking modules:
# Tracking-with-DEVA
cd ../../tracking
pip install -e .
# Grounded-SAM
git clone https://github.com/hkchengrex/Grounded-Segment-Anything.git
cd Grounded-Segment-Anything
export AM_I_DOCKER=False
export BUILD_WITH_CUDA=True
python -m pip install -e segment_anything
python -m pip install -e GroundingDINO
# RAM & Tag2Text
git submodule init
git submodule update
git clone https://github.com/xinyu1205/recognize-anything.git
pip install -r ./recognize-anything/requirements.txt
pip install -e ./recognize-anything/
- Install inpainting modules:
# LaMa
cd ../../inpaint/lama
pip install -r requirements.txt
- Install lighting estimation modules:
# DiffusionLight
cd ../../lighting/diffusionlight
pip install -r requirements.txt
- Install other required packages:
# Other packages
pip install openai objaverse kornia wandb open3d plyfile imageio-ffmpeg einops e3nn pygltflib lpips scann geffnetopen_clip_torch sentence-transformers==2.7.0 geffnet mmcv vedo
# PyTorch3D (try one of the commands)
pip install "git+https://github.com/facebookresearch/pytorch3d.git"
pip install "git+https://github.com/facebookresearch/pytorch3d.git@stable"
conda install pytorch3d -c pytorch3d
# Trimesh with speedup packages
pip install trimesh==4.3.2
pip install Rtree==1.2.0
conda install -c conda-forge embree=2.17.7
conda install -c conda-forge pyembree
# (Optional) COLMAP if not build from source
conda install conda-forge::colmap
cd ../..
We use DEVA for open-vocabulary video segmentation.
cd tracking
bash download_models.sh
We use LaMa to inpaint the unseen region.
cd inpaint && mkdir ckpts
wget https://huggingface.co/smartywu/big-lama/resolve/main/big-lama.zip && unzip big-lama.zip -d ckpts
rm big-lama.zip
We use CLIP & SBERT features to annotate assets in Objaverse, and we use SBERT features to annotate PBR materials in PolyHaven. The preprocessed embeddings of both Objaverse 3D assets and PolyHaven PBR materials need to be downloaded.
cd retrieval
# download processed embeddings
gdown --folder https://drive.google.com/drive/folders/1Lw87MstzbQgEX0iacTm9GpLYK2UE3gNm
# download PolyHaven PBR-materials
gdown https://drive.google.com/uc?id=1adZo_FPyLj7pFofNJfxSbnAv_EaJEV75
unzip polyhaven.zip && rm polyhaven.zip
We tested with Blender 3.6.11. Note that Blender 3+ requires Ubuntu version >= 20.04.
cd third_parties/Blender
wget https://download.blender.org/release//Blender3.6/blender-3.6.11-linux-x64.tar.xz
tar -xvf blender-3.6.11-linux-x64.tar.xz
rm blender-3.6.11-linux-x64.tar.xz
Please download the preprocessed dataset of Garden scene from here for quick demo. The expected folder structure of the dataset will be:
βββ datasets
β | <your scene name>
β βββ custom_camera_path # optional for free-viewpoint rendering
β βββ transforms_001.json
| βββ ...
β βββ images
| βββ 00000.png
| βββ 00001.png
| βββ 00002.png
| βββ ...
β βββ mesh
| βββ material_0.png
| βββ mesh.mtl
| βββ mesh.obj
β βββ emitter_mesh.obj # optional for indoor scenes
β βββ normal
| βββ 00000_normal.png
| βββ 00001_normal.png
| βββ 00002_normal.png
| βββ ...
| βββ sparse
| | 0
| βββ cameras.bin
| βββ images.bin
| βββ points3D.bin
| βββ transforms.json
For your custom dataset, please follow these steps:
- Create a folder and put your images under
images
. The folder will be like this:
βββ datasets
β | <your scene name>
β βββ images
| βββ 00000.png
| βββ 00001.png
| βββ 00002.png
| βββ ...
- Estimate normal maps for the usage of both pose alignment and normal regularization during 3DGS and BakedSDF training. Currently, we support three types of methods for monocular normal estimation, which are Metric3D, DSINE. and Omnidata. Empirically, the quality of normal estimation is ranked as Metric3D > DSINE > Omnidata.
python dataset_utils/get_mono_normal.py \
--dataset_dir ./datasets/<your scene name> \
--method metric3d # 'metric3d', 'dsine', 'omnidata'
- Perform pose extraction using COLMAP, followed by pose alignment to set the up direction of the scene to
(0,0,1)
. Specify a text prompt for the most obvious flat surfaces in the scene, such asground
,floor
ortable
.
python dataset_utils/colmap_runner.py \
--dataset_dir ./datasets/<your scene name> \
--text_prompt ground
- For details on surface mesh extraction, please refer to the Estimate Scene Properties section.
- All cameras are in camera-to-world coordinate with OpenCV format (x: right, y: down, z: front). Please refer to this tutorial on conversion between OpenCV and OpenGL camera format.
- We support sampling custom camera poses along a circular trajectory, please adjust the sampled parameters in
dataset_utils/sample_custom_traj.py
and run:
python dataset_utils/sample_custom_traj.py \
--dataset_dir ./datasets/<your scene name> \
--traj_name <your trajectory name> \
--vis_traj
We use BakedSDF implemented in SDFStudio for surface reconstruction. Please make sure to use our custom SDFStudio for reproducibility. We recommend to create an extra environemnt for this part since CUDA 11.3 has been tested on this repo.
# Example command
ns-train bakedsdf-mlp --vis wandb \
--output-dir outputs/<scene name> --experiment-name <experiment name> \
--trainer.steps-per-save 1000 \
--trainer.steps-per-eval-image 5000 --trainer.steps-per-eval-all-images 50000 \
--trainer.max-num-iterations 250001 --trainer.steps-per-eval-batch 5000 \
--pipeline.datamanager.train-num-rays-per-batch 2048 \
--pipeline.datamanager.eval-num-rays-per-batch 512 \
--pipeline.model.sdf-field.inside-outside False \
--pipeline.model.background-model none \
--pipeline.model.near-plane 0.001 --pipeline.model.far-plane 6.0 \
--machine.num-gpus 1 \
--pipeline.model.mono-normal-loss-mult 0.1 \
panoptic-data \
--data <path to your dataset> \
--panoptic_data False --mono_normal_data True --panoptic_segment False \
--orientation-method none --center-poses False --auto-scale-poses False \
Generally, a decent surface mesh can be obtained with the command above. However, there are several hyperparameters that you should be careful to set appropriately.
- For fully captured indoor scenes, such as those in ScanNet++, set
--pipeline.model.sdf-field.inside-outside
toTrue
. - For outdoor scenes with distant backgrounds, such as those in the Tanks & Temples, set
--pipeline.model.background-model
tomlp
. - Adjust
--pipeline.datamanager.train-num-rays-per-batch
,--pipeline.datamanager.eval-num-rays-per-batch
, and--pipeline.model.num-neus-samples-per-ray
if you encounter OOM (out-of-memory) errors during training.
scene=outputs/<scene name>/<experiment name>/bakedsdf-mlp/<timestamp>
# Extract mesh
python scripts/extract_mesh.py --load-config $scene/config.yml \
--output-path $scene/mesh.ply \
--bounding-box-min -2.0 -2.0 -2.0 --bounding-box-max 2.0 2.0 2.0 \
--resolution 2048 --marching_cube_threshold 0.001 --create_visibility_mask True --simplify-mesh True
mkdir $scene/textured
# Bake texture
python scripts/texture.py \
--load-config $scene/config.yml \
--input-mesh-filename $scene/mesh-simplify.ply \
--output-dir $scene/textured \
--target_num_faces None
It is better not changing bounding-box-min
and bounding-box-max
since camera poses are already normalized within a unit cube in the pose alignment step.
You could start training 3D gaussian splatting with one command.
bash train_3dgs.sh <your scene name>
Explanation of several hyperparameters used in train_3dgs.sh
:
- Optimization parameters:
lambda_normal
: loss between rendered normal and monocular normal predictionlambda_pseudo_normal
: loss between rendered normal and pseudo normal derived from rendered depthlambda_anisotropic
: regularize 3D gaussians shape to be isotropic
- Densification parameters:
- consider adjust
size_threshold
andmin_opacity
if the Gaussians are floating excessively.
- consider adjust
- Gaussians initialization parameters
--init_strategy
:colmap
: use a point cloud extracted from COLMAP for initializationray_mesh
: use intersection points between camera rays from all training views and the scene mesh for initialization.hybrid
: combine bothcolmap
andray_mesh
for initialization- Ensure that
--scene_sdf_mesh_path
is specified when usingray_mesh
orhybrid
Use the following script to determine the relative scale between the current scene and a real-world scenario. Then, set the --scene_scale
parameter to the estimated value during simulation.
python dataset_utils/estimate_scene_scale.py \
--dataset_dir ./datasets/<your scene name> \
--scene_mesh_path ./datasets/<your scene name>/mesh/mesh.obj \
--anchor_frame_idx 0
Please download the preprocessed Garden scene from here, and the pretrained 3DGS checkpoints and estimated scene properties from here.
# If you encounter an error with gdown, please use the Google Drive link above to download the files.
mkdir datasets && cd datasets
gdown --folder https://drive.google.com/drive/folders/1eRdSAqDloGXk04JK60v3io6GHWdomy2N
cd ../
mkdir output && cd output
gdown --folder https://drive.google.com/drive/folders/1KE8LSA_r-3f2LVlTLJ5k4SHENvbwdAfN
- Text Prompt: "Drop 5 basketballs on the table."
export OPENAI_API_KEY=/your/openai_api_key/
export MESHY_API_KEY=/your/meshy_api_key/ # if you want to retrieve generated 3D assets
SCENE_NAME=garden_large
CUSTOM_TRAJ_NAME=transforms_001
SCENE_SCALE=2.67
BLENDER_CONFIG_NAME=blender_cfg_rigid_body_simulation
python edit_scene.py \
--source_path datasets/${SCENE_NAME} \
--model_path output/${SCENE_NAME}/ \
--gaussians_ckpt_path output/${SCENE_NAME}/coarse/sugarcoarse_3Dgs15000_densityestim02_sdfnorm02/22000.pt \
--custom_traj_name ${CUSTOM_TRAJ_NAME} \
--anchor_frame_idx 0 \
--scene_scale ${SCENE_SCALE} \
--edit_text "Drop 5 basketballs on the table." \
--scene_mesh_path datasets/${SCENE_NAME}/mesh/mesh.obj \
--blender_config_name ${BLENDER_CONFIG_NAME}.json \
--blender_output_dir_name ${BLENDER_CONFIG_NAME} \
--render_type MULTI_VIEW \
--deva_dino_threshold 0.45 \
--is_uv_mesh
All the parameters are listed in the opt.py
.
Arguments used in opt.py
Path to the dataset directory.
Path to the output directory.
Path to the Gaussian model checkpoint (.pt for SuGaR, .ply for vanilla 3DGS).
Path to the reconstructed scene mesh (.ply or .obj).
Path to the emitter mesh for indoor lighting (.obj) (only used for indoor scenes).
Editing instructions.
Filename of custom trajectory (default: training cameras).
Index of the frame used for single-view simulation (default: 0).
Relative scale of the scene. If an object in the scene is 1 unit tall but is known to be 0.7 meters in the real world, the scene_scale is 0.7. This parameter is crucial for ensuring accurate size correspondence and realistic simulation or rendering.
Name of Blender output folder and Blender .json config.
Choose 'MULTI_VIEW' to render frames along the entire camera trajectory, or 'SINGLE_VIEW' for static rendering from a single camera position (i.e., anchor_frame_idx).
Specifies the number of frames to simulate and render (only used when '--render_type=SINGLE_VIEW').
Enable this option if the scene mesh have UV textures.
Enable this option if the scene is an indoor scene.
Enable this option to simulate on Waymo road scenes. A different prompt for GPT-4 is used to fulfill road scene simulations, similar to ChatSim.
Increase this threshold to reduce excessive object detection. (0.7 is optimal, but lower to 0.45 for hard-to-detect cases).
If you find this paper and repository useful for your research, please consider citing:
@article{hsu2024autovfx,
title={AutoVFX: Physically Realistic Video Editing from Natural Language Instructions},
author={Hsu, Hao-Yu and Lin, Zhi-Hao and Zhai, Albert and Xia, Hongchi and Wang, Shenlong},
journal={arXiv preprint arXiv:2411.02394},
year={2024}
}
This project is supported by the Intel AI SRS gift, Meta research grant, the IBM IIDAI Grant and NSF Awards #2331878, #2340254, #2312102, #2414227, and #2404385. Hao-Yu Hsu is supported by Siebel Scholarship. We greatly appreciate the NCSA for providing computing resources. We thank Derek Hoiem, Sarita Adve, Benjamin Ummenhofer, Kai Yuan, Micheal Paulitsch, Katelyn Gao, Quentin Leboutet for helpful discussions.
Our codebase are built based on gaussian-splatting, SuGaR, SDFStudio, DiffusionLight, DEVA, Objaverse, and the most important Blender. Thanks for open-sourcing!.