Spoilers: The model is similar to the original paper but replaces the cumbersome detection network with a CLIP vision encoder (which can be trained end-to-end without relying on an external model), and utilizes adapters on the decoder side
Coco Images HDF5 file: Download
Annotations: Download
Clone the repository and create the Violet
conda environmnet
conda env create -f violet.yml
make logs and saved_models directories
mkdir logs
mkdir saved_models
Early checkpoint: Download
simpler and more friendly impelementation (You can ignore the data and evaluation folders when using this)
python train_refactored.py --batch_size 60 --head 12 --tau 0.3 --images_path coco_images.h5 --annotation_folder annotations --lr 1e-4 --random_seed 42 --log_file logs/log --decoder_layer 12 --optimizer_type adamw --gradient_accumulation_steps 1 --exp_name violet
based on the code used in Meshed transformer and VisualGPT, edited to use python 3 instead of the original 2.7
python train_legacy.py --batch_size 40 --head 12 --tau 0.3 --features_path ./coco_images.h5 --annotation_folder annotations --lr 1e-4 --random_seed 42 --log_file logs/log --decoder_layer 12 --optimizer_type adamw --gradient_accumulation_steps 1 --exp_name violet
This code used resources from Meshed Memory Transformer, Transformers and VisualGPT