Skip to content

Latest commit

 

History

History
46 lines (30 loc) · 2.11 KB

README.md

File metadata and controls

46 lines (30 loc) · 2.11 KB

VioletV2

Techincal report coming soon!

Spoilers: The model is similar to the original paper but replaces the cumbersome detection network with a CLIP vision encoder (which can be trained end-to-end without relying on an external model), and utilizes adapters on the decoder side

Data

Coco Images HDF5 file: Download

Annotations: Download

Environment setup

Clone the repository and create the Violet conda environmnet

conda env create -f violet.yml

make logs and saved_models directories

mkdir logs
mkdir saved_models

Checkpoint

Early checkpoint: Download

Train the model (refactored code)

simpler and more friendly impelementation (You can ignore the data and evaluation folders when using this)

python train_refactored.py --batch_size 60 --head 12 --tau 0.3 --images_path coco_images.h5  --annotation_folder annotations --lr 1e-4 --random_seed 42 --log_file logs/log --decoder_layer 12 --optimizer_type adamw  --gradient_accumulation_steps 1  --exp_name violet

Train the model (legacy code)

based on the code used in Meshed transformer and VisualGPT, edited to use python 3 instead of the original 2.7

python train_legacy.py --batch_size 40 --head 12 --tau 0.3 --features_path ./coco_images.h5  --annotation_folder annotations --lr 1e-4 --random_seed 42 --log_file logs/log --decoder_layer 12 --optimizer_type adamw  --gradient_accumulation_steps 1  --exp_name violet

Acknowledgement

This code used resources from Meshed Memory Transformer, Transformers and VisualGPT