Duoduo CLIP: Efficient 3D Understanding with Multi-View Images

Han-Hung Lee^*1, Yiming Zhang^*1 and Angel Xuan Chang^1,2

^* Equal Contribution ¹ Simon Fraser University ² Canada-CIFAR AI Chair, Amii

Abstract

We introduce Duoduo CLIP, a model for 3D representation learning that learns shape encodings from multi-view images instead of point-clouds. The choice of multi-view images allows us to leverage 2D priors from off-the-shelf CLIP models to facilitate fine-tuning with 3D data. Our approach not only shows better generalization compared to existing point cloud methods, but also reduces GPU requirements and training time. In addition, we modify the model with cross-view attention to leverage information across multiple frames of the object which further boosts performance. Compared to the current SOTA point cloud method that requires 480 A100 hours to train 1 billion model parameters we only require 57 A5000 hours and 87 million parameters. Multi-view images also provide more flexibility in use cases compared to point clouds. This includes being able to encode objects with a variable number of images, with better performance when more views are used. This is in contrast to point cloud based methods, where an entire scan or model of an object is required. We showcase this flexibility with object retrieval from images of real-world objects. Our model also achieves better performance in more fine-grained text to shape retrieval, demonstrating better text-and-shape alignment than point cloud based models.

Notes

This is the official initial release for the paper Duoduo CLIP: Efficient 3D Understanding with Multi-View Images. In this release we provide evaluation for the LVIS split of Objaverse as well object retrieval from text. We will release the entire data preparation and training code soon. See TODOs for items we will add to the repo. The pretrained models as well as model cards will be provided in this repo and the data here.

Environment Setup

Conda

We use miniconda to manage system dependencies.

# create and activate the conda environment
conda create -n ddclip python=3.10
conda activate ddclip

# install PyTorch
conda install pytorch==2.2.1 torchvision==0.17.1 torchaudio==2.2.1 pytorch-cuda=12.1 -c pytorch -c nvidia

# install Python libraries
pip install -r requirements.txt
cd open_clip_mod
pip install .

# install Faiss
conda install -c pytorch -c nvidia faiss-gpu=1.8.0

Example

import numpy as np
from PIL import Image

from src.model.wrapper import get_model

mv_images = Image.open('examples/couch.jpg')
mv_images = np.asarray(mv_images).reshape(12, 224, 224, 3)

duoduoclip = get_model('Four_1to6F_bs1600_LT6.ckpt', device='cuda')
text_features = duoduoclip.encode_text('a 3D model of a white couch')

# The model can take multi-view images of shape (F, H, W, 3)
# (F is number of multi-views, H and W are image resolutions)
image_features = duoduoclip.encode_image(mv_images)
similarity = text_features.squeeze() @ image_features.squeeze()
print(similarity)

# The model can also take single view images of shape (H, W, 3)
# (H and W are image resolutions)
image_features = duoduoclip.encode_image(mv_images[0])
similarity = text_features.squeeze() @ image_features.squeeze()
print(similarity)

Dataset

Objaverse LVIS

Download the objaverse lvis files (~80GB) for evaluation and placed in the dataset/data folder.

python preprocess/download_lvis.py

Preprocessed Objaverse Embeddings

We also provide embeddings for each object of the entire objaverse dataset using 12 randomly rendered views for each object.

Download the shape embeddings (~800M). This includes the shapes embeddings produced by the default model placed under dataset/data/objaverse_embeddings/Four_1to6F_bs1600_LT6.

python preprocess/download_embeddings.py

Training Data

Four Dataset

Note this will take a large amount of disk space to store as there is about 800k renderings of objects each with 12 views at resolution 224x224.

Download the multi-view objaverse renderings from Zero123 (~1.5TB) and uncompress.

wget https://tri-ml-public.s3.amazonaws.com/datasets/views_release.tar.gz

Download supplement multi-view images (~160GB) not included in Zero123 as well as image and text embeddings used in training.

python preprocess/download_supplement.py

Merge multi-view images into an h5 file. Resulting h5 file is ~1.5TB.

python preprocess/combine_four_h5.py --zero123_path /path/to/uncompressed/views_release

Evaluation

Objaverse

Run the objaverse lvis evaluation over multiple view settings. The model here is trained with 1 to 6 frames sampled during training with last 6 layers trainable.

python test_objaverse_lvis.py ckpt_path=Four_1to6F_bs1600_LT6.ckpt

Retrieval

Retrieve objaverse models using text as input. You can visualize models here.

python text_retrieval.py ckpt_path=Four_1to6F_bs1600_LT6.ckpt

Training

Four

Train model with command. (Note this requires 4 gpus with at least 24GB of memory)

python train.py experiment_name=Four_1to6F_bs1600_LT6 trainer.devices=4

Important Flags

# Number of GPUs
trainer.devices=4

# Training batch size
data.train.dataloader.batch_size=400

# Number of multi-views to sample during training
"data.train.metadata.num_views=[1, 6]"

# Layer threshold to train (here all layers after and including the 6th layer will be trained, set to 0 to train all layers)
model.network.layers_threshold=6

TODOs

Add data preparation code for Four, MVImgNet and Text2Shape.
Add training code for all setting in the paper.
Add evaluation scripts for MVPNet and Text2Shape.

Acknowledgements

Code

OpenCLIP: Our model backbones and weights are based off the open source implementation OpenCLIP. The folder open_clip_mod contains the same code as in the OpenCLIP, but with some minor modifications to expose some additional functions from the package. The code within src/custom_clip modifies the OpenCLIP models to support the multi-view attention as described in the paper.

OpenShape: Our training framework closely follows that of OpenShape. We also use their provided model ids and text captions of their released dataset for training.

Zero123: A large chunk of our rendered images for objects come from the paper Zero123, we also use their rendering script to render images for remaining objects.

We thank the authors for their work and releasing their code and weights!

Citing

@inproceedings{lee2025duoduo,
  author = {Lee, Han-Hung and Zhang, Yiming and Chang, Angel X},
  title = {Duoduo CLIP: Efficient 3D Understanding with Multi-View Images},
  booktitle = {International Conference on Learning Representations (ICLR)},
  year = {2025}
}

Funding

This work was funded by a CIFAR AI Chair, a NSERC Discovery grant, and a CFI/BCKDF JELF grant.

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
config		config
dataset/csv_list		dataset/csv_list
docs		docs
examples		examples
open_clip_mod		open_clip_mod
preprocess		preprocess
src		src
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
test_objaverse_lvis.py		test_objaverse_lvis.py
test_zs_objaverse_lvis.py		test_zs_objaverse_lvis.py
text_retrieval.py		text_retrieval.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Duoduo CLIP: Efficient 3D Understanding with Multi-View Images

Abstract

Notes

Environment Setup

Conda

Example

Dataset

Objaverse LVIS

Preprocessed Objaverse Embeddings

Training Data

Four Dataset

Evaluation

Objaverse

Retrieval

Training

Four

TODOs

Acknowledgements

Code

Citing

Funding

About

Releases

Packages

Contributors 2

Languages

3dlg-hcvc/DuoduoCLIP

Folders and files

Latest commit

History

Repository files navigation

Duoduo CLIP: Efficient 3D Understanding with Multi-View Images

Abstract

Notes

Environment Setup

Conda

Example

Dataset

Objaverse LVIS

Preprocessed Objaverse Embeddings

Training Data

Four Dataset

Evaluation

Objaverse

Retrieval

Training

Four

TODOs

Acknowledgements

Code

Citing

Funding

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages