Skip to content

[WACV 2025] I Dream My Painting: Connecting MLLMs and Diffusion Models via Prompt Generation for Text-Guided Multi-Mask Inpainting

License

Notifications You must be signed in to change notification settings

cilabuniba/i-dream-my-painting

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

7 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

I Dream My Painting: Connecting MLLMs and Diffusion Models via Prompt Generation for Text-Guided Multi-Mask Inpainting (WACV 2025)

This repository contains the reference code for the paper "I Dream My Painting: Connecting MLLMs and Diffusion Models via Prompt Generation for Text-Guided Multi-Mask Inpainting".

Open In Colab

Installation

Prerequisites: CUDA>=12.1, Conda for environment management, Git LFS for downloading Stable Diffusion 2 Inpainting.

Use the following command to install the Conda environment:

conda env create -f env.yaml

Activate the environment with:

conda activate i-dream-my-painting

Next, install the English language model for spaCy:

python -m spacy download en_core_web_sm

Prepare Dataset

We provide the results of the Kosmos-2 model for global image annotations and the LLaVA model for object-level annotations, hoping to facilitate research in the computer vision and artistic domain. Enter the data directory and unzip the annotations files:

unzip annotations.zip

This will create the annotations.json file with the global image annotations and the llava.json file with the object-level annotations. The object-level annotations can be used only after the Make the Dataset step.

Download the Images

To download the images at our thumbnail sizes using the WikiArt API, you can use the following command:

python -m inpainting.data.downloader download-and-save-images-wikiart-v2 -o data/mm_inp_dataset/images

The tqdm progress bar may stop because of multi-processing. Sometimes there might be download errors. Repeat the above command to download the missing images in case of errors. You will have downloaded all images if the command:

ls data/mm_inp_dataset/images | wc -l

returns 116475.

Make the Dataset

The code for data annotation and preprocessing is available in the inpainting.data package. We provide the commands to run the annotation and preprocessing scripts using the Kosmos-2 outputs in the data/annotations.json file. If you have downloaded the images as in the Download the Images step and have the provided files data/annotations.json and data/llava.json, you can skip the most time-consuming step 4, for LLAVA captioning, and move directly to step 5.

  1. Make masks from annotations (~10min):

    python -m inpainting.data.ops make-masks-dataset --annotations-path data/annotations.json --image-dir data/mm_inp_dataset/images --out-dir data/mm_inp_dataset/masks

    This will save masks as PNG images in the data/mm_inp_dataset/masks directory. The directory will contain one subdirectory per image, each containing the masks for that image with denominations mask_0.png, mask_1.png, etc.

  2. Make entities dataset (~10min):

    python -m inpainting.data.ops make-entities-dataset --annotations-path data/annotations.json --image-dir data/mm_inp_dataset/images --out-dir data/mm_inp_dataset/entities

    This will save crops from the images corresponding to the bounding boxes detected by Kosmos-2. For entities that correspond to multiple bounding boxes, the crops are put together in a single image. The directory will contain one subdirectory per image, each containing the crops for that image with denominations mask_0.png, mask_1.png, etc. Additionally, each subdirectory will contain an annotations.json file with a dictionary whose keys are the masks (denominated as mask_0, mask_1, etc.) and whose values are another dictionary with the key concept and the noun chunk detected by Kosmos-2.

  3. Extract noun chunk roots using Spacy (~2min):

    python -m inpainting.data.ops extract-noun-chunk-roots --entities-dir data/mm_inp_dataset/entities

    This will update the image-specific annotations files in the data/mm_inp_dataset/entities directory with the key noun_chunk_root whose value is the root of the noun chunk detected by Kosmos-2 (used for accuracy evaluation).

  4. Caption masks (hours) - skip if you have data/llava.json:

    python -m inpainting.data.ops caption-masks --entities-dir data/mm_inp_dataset/entities --model-id llava-hf/llava-v1.6-vicuna-13b-hf --batch-size 8 --max-new-tokens 40 --num-processes 1 --process-id 0

    This will use LLaVA-1.6-Vicuna-13B to caption the masks and create object-level annotations. You can run this script multiple times with different --process-id values to parallelize the process (and setting --num-processes to the number of parallel processes). The script will save the LLaVA outputs every 100 steps in a backup json file, whose keys are the image names and whose values are dictionaries with the keys mask_0, mask_1, etc. and whose values are the captions. Backups are saved with the name backup_{process_id}.json in the directory where the script is run. They will also contain an additional key steps to keep track of the number of steps taken. Continuing the process from a backup file is not implemented in the script, but it can be done manually by adjusting the code. At the end, the final annotations will be saved in the file backup_{process_id}_final.json in the directory where the script is run.

  5. Move LLaVA annotations to the entities directory (~5sec):

    If you have downloaded the dataset as indicated in the Section Download Dataset, you can move the LLaVA annotations to the entities directory with the following command:

    python -m inpainting.data.ops llava-annotations-to-folder --annotations-path data/llava.json --out-dir data/mm_inp_dataset/entities

    otherwise, if you have executed the previous step, you can move the annotations with the following command:

    python -m inpainting.data.ops llava-annotations-to-folder --annotations-path backup_{process_id}_final.json --out-dir data/mm_inp_dataset/entities

    This will move the LLaVA annotations from the backup file to the entities directory, creating a new file llava.json whose keys are the masks and whose values are the captions, in the corresponding image subdirectory.

  6. Clean and save LLaVA annotations (~10sec):

    python -m inpainting.data.ops llava-annotations-to-entity-annotations --entities-dir data/mm_inp_dataset/entities

    This will clean the LLaVA annotations (remove prefix, make lowercase, remove strange characters) and save them in the annotations.json file in the image subdirectories. For each mask, a new key caption is added to the dictionary with the cleaned caption.

  7. Split dataset (dataset is already split if you have downloaded the dataset as indicated in the Section Download Dataset):

    python -m inpainting.data.ops train-val-test-split --image-dir data/mm_inp_dataset/images --masks-dir data/mm_inp_dataset/masks --split-df-path data/wikiart_images.csv

    This will split the dataset into training, validation, and test sets, saving the corresponding image files in subdirectories with names train, val, and test in the data/mm_inp_dataset/images directory. Images without masks are saved in the unannotated subdirectory.

At the end of this process the dataset examples will look like this:

Where the image is associated with multiple masks (in the masks dir) and for each mask we have an object crop with its LLaVA-generated object-level description (in the entities dir).

Setup Multi-Mask Inpainting with RCA

To use our multi-mask inpainting model, you need to clone the stabilityai/stable-diffusion-2-inpainting model using Git LFS from its repository on the Hugging Face Hub. In the repository, click on the icon with three vertical dots, select "Clone repository," and follow the instructions to clone the repo into the models directory (you should have a models/stable-diffusion-2-inpainting directory).

Next, replace stable-diffusion-2-inpainting/model_index.json and stable-diffusion-2-inpainting/unet/config.json with the files from the models/sd_replacements directory of this repository.

Download Model Weights

Download the model weights from this URL.

  • Prompt generation (LLaVA-MultiMask): unzip the folder multimask.zip and move it to the models/llava directory. This folder also includes the optimizer states and other data needed to continue training.
  • Multi-mask inpainting (SD-2-Inp-RCA-FineTuned): unzip the folder rca.zip and move it to the models/sd directory.

Try the Model!

If you have performed the steps in the Prepare Dataset sections, and have downloaded the model weights as indicated in the Download Model Weights section, also setting up the Multi-Mask Inpainting with RCA as indicated in the Setup Multi-Mask Inpainting with RCA section, you can now try the model!

Go in the notebooks directory and run the try_pipe.ipynb notebook.

Experiments

Prerequisites

Models

To train the models you have to download them from πŸ€— Hugging Face Hub. Please, download the following models:

Logging

To log the training process, you need to have a Weights & Biases account and install the wandb package:

pip install wandb

If you are running this code on a HPC cluster, it is possible that you'll need to log information offline:

wandb offline

Prompt Generation

We present prompt generation results:

Accuracy (%) BLEU@1 BLEU@4 ROUGE-L CLIPSim
LLaVA-Prompt 7.74 20.81 1.30 19.99 22.46
LLaVA-1Mask 36.52 36.99 12.58 34.64 24.65
LLaVA-MultiMask-1Pred 35.48 37.68 13.15 34.98 24.79
LLaVA-MultiMask-LastPred 33.08 37.40 12.61 34.45 24.46
LLaVA-MultiMask-All 31.73 37.33 12.43 34.33 24.24

And we provide the commands to reproduce them:

  1. LLaVA-Prompt

    • Test
      accelerate launch -m inpainting.models.image_to_text.fine_tune_llavanext train --config-path=models/configs/image_to_text/test/config_base.yaml
  2. LLaVA-1Mask

    • Train

      accelerate launch -m inpainting.models.image_to_text.fine_tune_llavanext train --config-path=models/configs/image_to_text/train/llava_1mask.yaml
    • Test

      accelerate launch -m inpainting.models.image_to_text.fine_tune_llavanext train --config-path=models/configs/image_to_text/test/llava_1mask.yaml
  3. LLaVA-MultiMask

    • Train

      accelerate launch -m inpainting.models.image_to_text.fine_tune_llavanext train --config-path=models/configs/image_to_text/train/llava_multimask.yaml
    • Test (LLaVA-MultiMask-1Pred)

      accelerate launch -m inpainting.models.image_to_text.fine_tune_llavanext train --config-path=models/configs/image_to_text/test/llava_multimask_1pred.yaml
    • Test (LLaVA-MultiMask-LastPred and LLaVA-MultiMask-All)

      accelerate launch -m inpainting.models.image_to_text.fine_tune_llavanext train --config-path=models/configs/image_to_text/test/llava_multimask.yaml

We trained the models for 1 epoch on four NVIDIA A100 64GB GPUs, where a single training requires ~9.5 hours. Tests are performed on a single NVIDIA A100 64GB GPU. They require ~6 hours.

Once a training or test is completed the results will be logged to Weights & Biases.

Multi-Mask Inpainting

We present multi-mask inpainting results:

FID ↓ LPIPS ↓ PSNR ↑ CLIP-IQA ↑ CLIPSim-I2I ↑ CLIPSim-T2I ↑
SD-2-Inp-HQPrompt 19.18 (31.86) 22.82 (24.29) 14.55 (14.36) 71.51 (73.89) 84.87 (85.63) 21.10 (20.92)
SD-2-Inp 15.07 (27.40) 21.90 (23.63) 14.66 (14.35) 73.10 (75.74) 88.87 (88.93) 25.70 (24.91)
SD-2-Inp-RCA 15.39 (28.03) 21.98 (23.78) 14.59 (14.24) 73.24 (75.83) 88.83 (88.85) 25.81 (25.04)
SD-2-Inp-FineTuned 15.49 (27.83) 22.06 (23.96) 14.44 (14.06) 74.64 (77.68) 89.05 (89.04) 26.31 (25.40)
SD-2-Inp-RCA-FineTuned 15.32 (27.45) 22.00 (23.74) 14.46 (14.13) 74.30 (77.21) 89.28 (89.35) 26.72 (25.93)
SD-2-Inp-RCA-FineTuned-Gen 15.30 (27.94) 22.69 (24.42) 14.05 (13.64) 72.80 (76.01) 87.47 (87.68) 23.25 (22.94)

And we provide the commands to reproduce them:

  1. SD-2-Inp-HQPrompt

    • Generate

      python -m inpainting.models.text_to_image.test generate --config-path models/configs/text_to_image/test/sd_2_inp_hqprompt.yaml
    • Compute

      python -m inpainting.models.text_to_image.test compute --config-path models/configs/text_to_image/test/sd_2_inp_hqprompt.yaml
  2. SD-2-Inp

    • Generate

      python -m inpainting.models.text_to_image.test generate --config-path models/configs/text_to_image/test/sd_2_inp.yaml
    • Compute

      python -m inpainting.models.text_to_image.test compute --config-path models/configs/text_to_image/test/sd_2_inp.yaml
  3. SD-2-Inp-RCA

    • Generate

      python -m inpainting.models.text_to_image.test generate --config-path models/configs/text_to_image/test/sd_2_inp_rca.yaml
    • Compute

      python -m inpainting.models.text_to_image.test compute --config-path models/configs/text_to_image/test/sd_2_inp_rca.yaml
  4. SD-2-Inp-FineTuned

    • Train

      accelerate launch -m inpainting.models.text_to_image.train_text_to_image_lora main --config-path=models/configs/text_to_image/sd_2_inp_finetuned.yaml
    • Generate

      python -m inpainting.models.text_to_image.test generate --config-path models/configs/text_to_image/test/sd_2_inp_finetuned.yaml
    • Compute

      python -m inpainting.models.text_to_image.test compute --config-path models/configs/text_to_image/test/sd_2_inp_finetuned.yaml
  5. SD-2-Inp-RCA-FineTuned

    Preliminary: to perform this experiment you need to clone the stabilityai/stable-diffusion-2-inpainting repository and edit the model files as explained in the Section Download Model Weights.

    • Train

      accelerate launch -m inpainting.models.text_to_image.train_text_to_image_lora main --config-path=models/configs/text_to_image/sd_2_inp_rca_finetuned.yaml
    • Generate

      python -m inpainting.models.text_to_image.test generate --config-path models/configs/text_to_image/test/sd_2_inp_rca_finetuned.yaml
    • Compute

      python -m inpainting.models.text_to_image.test compute --config-path models/configs/text_to_image/test/sd_2_inp_rca_finetuned.yaml
  6. SD-2-Inp-RCA-FineTuned-Gen

    • Train: not needed, the used model is the same as the previous one.

    • Generate prompts

      python -m inpainting.models.text_to_image.test generate-prompts --config-path models/configs/text_to_image/test/sd_2_inp_rca_finetuned_gen.yaml
    • Generate

      python -m inpainting.models.text_to_image.test generate --config-path models/configs/text_to_image/test/sd_2_inp_rca_finetuned_gen.yaml
    • Compute

      python -m inpainting.models.text_to_image.test compute --config-path models/configs/text_to_image/test/sd_2_inp_rca_finetuned_gen.yaml

Citation

If you make use of our work, please cite our paper:

@inproceedings{fanelli2025idream,
  title     = {I Dream My Painting: Connecting MLLMs and Diffusion Models via Prompt Generation for Text-Guided Multi-Mask Inpainting},
  author    = {Nicola, Fanelli and Gennaro, Vessio and Giovanna, Castellano},
  year      = {2025},
  booktitle   = {Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision}
}

About

[WACV 2025] I Dream My Painting: Connecting MLLMs and Diffusion Models via Prompt Generation for Text-Guided Multi-Mask Inpainting

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published