Authors: Rishabh Srivastava (rs4489@columbia.edu), Addrish Roy (ar4613@columbia.edu)
The project demonstrates the potential of integrating abstract art interpretation with text-to-image synthesis using ControlNet. By introducing geometric primitives as conditions, we enable finer control over image generation, paving the way for more nuanced artistic expression. By leveraging ControlNet, which integrates diverse conditioning inputs with Stable Diffusion, we introduce a novel condition crafted from geometric primitives to guide image generation.
A detailed analysis is presented in this arxiv paper.
The dataset was curated from the 1% data sample file of the Wikipedia-based Image Text (WIT) Dataset. Captions for images were generated using the BLIP model. Control images were derived using the Primitive software, resulting in a dataset of 14,279 pairs of control and target images with captions.
Note: If you want to create the dataset on your own, download the data sample zip file, unzip it, and rename the file obtained to data.tsv
before running the Python notebook create_dataset.ipynb
to generate the dataset. You need to install Primitive in your command line if you want to generate a similar dataset for yourself or generate the control image for an image of your own. Primitive helps you convert your images to abstract images using geometric primitives like triangles.
ControlNet integrates additional conditions into the main neural network block, allowing for fine control over the image generation process. The architecture involves locking the original block parameters and creating a trainable duplicate linked with zero convolution layers.
The model was trained on a subset of the dataset using an Nvidia T4 GPU. The training process involved initializing a pre-trained model, setting up a DataLoader, and iteratively processing image-text pairs to optimize the model's parameters.
The training dataset can be found here. Make sure the folder structure looks like this -
training
├── geometricShapes14k
├── prompt.json
├── source
├── target
The trained model can generate new images by inputting a sample abstract image and a prompt. While the model can function without prompts, its performance is optimized when prompts are provided.
The final model can be found on Huggingface - control_sd15_ini10_10__myds.pth.
The model was trained on more than 14,000 images, demonstrating a sudden convergence phenomenon where it adheres closely to the input condition. The results showcase diverse interpretations of the same abstract image based on different prompts.
The ControlNet model effectively preserved object locations and aligned with prompts, although it struggled with color replication due to limited training.
An example of how the same abstract image can be interpreted differentlyFuture work will involve enhancing the dataset and exploring quantitative evaluations of the synthesized images.