This is my attempt at implementing Zero-Shot Contrastive Loss for Text-Guided Diffusion Image Style Transfer .
I got interested in the possibility of manipulating pretrained text2img models for video edition. I googled CLIP-based Style Transfer and I stumbled upon this paper, which didn't have an open implementation so I decided to do it myself.
Clone submodules:
git clone https://github.com/openai/CLIP
git clone https://github.com/ouhenio/guided-diffusion.git
Install submodules dependencies:
pip install -e ./CLIP & pip install -e ./guided-diffusion
Download the unconditional diffusion model (weights 2.06GB):
wget -O unconditional_diffusion.pt https://openaipublic.blob.core.windows.net/diffusion/jul-2021/256x256_diffusion_uncond.pt
Sadly, the usage interface is pretty lacking:
python main.py
To try different styles, hyperparameters, and images, edit these lines in main.py
:
139: guidance prompt
216: loss hyperparameters
155: initial image
Image | Prompt | Global Loss | Directional Loss | Feature Loss | MSE Loss | ZeCon Loss |
---|---|---|---|---|---|---|
portrait | None | None | None | None | None | |
cubism | 20000 | 15000 | 50 | 3000 | 10 | |
3d render in the style of Pixar | 5000 | 5000 | 100 | 10000 | 500 |
I've found that this method kinda works but it is very sensitive to hyperparams, which makes it frustrating to use.
Table 5 of the paper makes me confident that the authors had the same issue.