In order to toonify real face images we leverage modified copies (no git submodules) of several existing repositories:
- https://github.com/justinpinkney/stylegan2 (commit
dbf69a9
) by Justin Pinkney - https://github.com/dvschultz/stylegan2-ada-pytorch (commit
9b6750b
) by Derrick Schultz - https://github.com/eladrich/pixel2style2pixel (commit
334f45e
) by Elad Richardson
and leverage several pre-trained models:
- StyleGAN2 FFQH 1024
- StyleGAN2 FFHQ 1024 config F
- VGG16 Feature Extractor
- MMOD Human Face Detector
- Shape Predictor 68 Face Landmarks
- IR-SE50
- 1. Data Preparation
- 2. Finetuning and Blending the FFHQ Model
- 3. Projection of Input Images into the Latent Space
- 4. Image-to-Image Translation GAN
We use a collection of about 1000 disney/pixar style cartoon face images which were collected using a web scraper and a custom web tool for image management and image cropping.
Store cartoon face images in a folder ./faces
.
conda env create -f environment.yml
conda activate toon
As we are going to apply transfer-learning on the FFHQ_1024 model, we have to resize all images to 1024×1024 and align all cartoon images to have similar face-keypoint positions as the face images from the Flickr Faces HQ dataset used to train the FFHQ model.
In oder to find face-keypoints, we need to begin with detecting cartoon faces.
The dlib library provides two functions that can be used for cartoon face detection:
- HOG + Linear SVM face detection (fast and less accurate)
- MMOD CNN face detector (slow and more accurate)
We detect 68 face landmarks using the pretrained landmarks detector model shape_predictor_68_face_landmarks from dlib.
To get an idea what landmark detection means, run
> python test_face_detection.py --detector hog
> python test_face_detection.py --detector mmod
For aligining the cartoon images similar to FFHQ, we leverage the face_alignment.py
script from the stylegan2
repository:
> python align_image_data.py --help
> python align_image_data.py --images-source /path/to/images --images-target dataset
The align_image_data.py
script uses the MMOD face detection model by default, as it detects more cartoon faces.
In order to visually verify and cleanup properly aligned cartoon faces (the remove
button deletes images from the target folder), open a Jupyter notebook and run
from utils import print_images
print_images(source_folder='/path/to/images', target_folder='dataset', image_filter='aligned', landmarks=True)
To visually verify which cartoon face could not be aligned or which were deleted manually, run
from utils import print_images
print_images(source_folder='/path/to/images', target_folder='dataset', image_filter='missed')
The repository stylegan2-ada-pytorch
allows us to finetune the pre-trained FFHQ model using
transfer-learning.
This model was created using StyleGAN2, which is an improved generative adversarial network (GAN) published by Nvidia 2020.
StyleGAN2-ADA is a further improved GAN which leverages adaptive discriminator augmentation (ADA) to prevent overfitting due to a small dataset.
StyleGAN2-ADA requires a GPU which is available using Google Colab:
Good practice is to store all datasets and training results on Google Drive.
First, mount Google Drive and create a project folder with
from google.colab import drive
drive.mount('/content/drive')
project = '/content/drive/MyDrive/toonification'
%mkdir -p {project}
then, zip the images dataset and upload to a Google Drive folder:
> gdrive list -q "name contains 'toonification' and mimeType contains 'folder'"
> gdrive upload -p <folderId> dataset.zip
then install dependencies with
!git clone https://github.com/thomd/stylegan2-toonification.git
!pip install --quiet ninja opensimplex torch==1.7.1 torchvision==0.8.2
!nvidi-smi
%cd stylegan2-toonification/stylegan2-ada-pytorch/
Start Tensorboard for tracking metrics (you might Firefox deactivate "Enhanced Tracking Protection" in Firefox):
%load_ext tensorboard
%tensorboard --logdir={project}/results/
Google Colab notebooks have an idle timeout of 90 minutes and absolute timeout of 12 hours. This means, if user does not interact with his Google Colab notebook for more than 90 minutes, its instance is automatically terminated. Also, maximum lifetime of a Colab instance is 12 hours.
Therefore it is good practice to save a snapshop very often (setting snap = 1
) and resume with the last snapshot after 12 hours.
Start fine-tuning the FFHQ model:
resume = 'ffhq1024'
nkimg = 0
aug = 'ada'
augpipe = 'bgcf'
freezed = 0
gamma = 10
target = 0.6
!python train.py \
--cfg='11gb-gpu' \
--outdir={project}/results \
--data={project}/dataset.zip \
--snap=1 \
--resume={resume} \
--freezed={freezed} \
--aug={aug} \
--augpipe={augpipe} \
--gamma={gamma} \
--target={target} \
--mirror=True \
--nkimg={nkimg}
When resuming with a sanpshot, set resume
and nkimg
accordingly and start the trainign script again:
resume = {project}/results/.../network-snapshot-000xxx.pkl
nkimg = xxx
Stop trainign as soon as the losses in Tensorboard reach a plateau. Then do variations of freezed
, gamma
, aug
,
augpip
and target
and keep results for later tests with real face images.
In order to determine the best hyper-parameter and sufficient training time, use the model snapshots to generate new single images of cartoon faces using a latent vector of a random normal distribution.
The hpyer parameter are part of the results
folder name which allows to select the parametrized model snapshots.
The truncation
parameter truncates the probability distribution of the latent space vector. It kind of affects diversity, the smaller the number the more realistic your images are. Values are typically etween 0.5
and 1.0
.
seeds = '42-47'
truncation = 0.5
model = f'results/{num}-dataset-mirror-11gb-gpu-gamma{gamma}-ada-target{target}-{augpipe}-resume{resume}-freezed{freezed}/network-snapshot-{snapshot}.pkl'
%mkdir -p {project}/out
!python generate.py \
--outdir={project}/out/ \
--trunc={truncation} \
--seeds={seeds} \
--network={project}/{model}
from PIL import Image
import os
import numpy as np
images = []
for image in sorted(os.listdir(f'{project}/out/')):
images.append(np.array(Image.open(f'{project}/out/' + image).resize((256, 256))))
print(f'\nFreezeD: {freezed}, Gamma: {gamma}, AugPipe: {augpipe}')
display(Image.fromarray(np.concatenate(images, axis=1)))
You can take two completely different models and combine them by splitting them at a specific resolution and combining the lower layers (pose, eyes, ...) of one model and the higher layers (texture) of another model. Model blending (aka Layer Swapping) tends to work best when one of the models is transfer learned from the other.
We got best results when using the lower layers from the cartoon-faces model, the higher layers from the FFHQ model and a resolution to split model weights of 128
.
%cd /content
ffhq_model = 'ffhq-res1024-mirror-stylegan2-noaug.pkl'
%mkdir -p {project}/blending
cartoon_model = f'results/{num}-dataset-mirror-11gb-gpu-gamma{gamma}-ada-target{target}-{augpipe}-resume{resume}-freezed{freezed}/network-snapshot-{snapshot}.pkl'
%cd stylegan2-toonification/stylegan2-ada-pytorch/
!python blend_models.py \
--lower_res_pkl {project}/results/{cartoon_model} \
--split_res 128 \
--higher_res_pkl {ffhq_model} \
--output_path {project}/blending/cartoon_ffhq_blended_128.pkl
In order to toonify given input images from real faces, we need to project them to it's latent vector representation using a variational autoencoder (VAE). This process is also known as StyleGAN Inversion.
GANs learn to generate outputs from random latent vectors that mimic the appearance of your input data, but not necessarily the exact samples of your input data. VAEs learn to encode your input samples into latent vectors, and then also learn to decode latent vectors back to it’s (mostly) original form.
This works best, if the input images are cropped and aligned similar to the cartoon faces dataset:
%cd /content/stylegan2-toonification/
!python align_image_data.py --images-source input --images-target input_aligned
Besides the standard projection techinque provided in Nvidias styelegan2 repository, there is an additional projection technique by Peter Baylies provided in the stylegan2-ada-pytorch repository. We use the latter, as it creates better results:
%cd /content/stylegan2-toonification/stylegan2-ada-pytorch/
!python pbaylies_projector.py \
--outdir=/content/input \
--target-image=/content/input/raw_00.png \
--num-steps=500 \
--save-video=False \
--use-clip=False \
--use-center=False \
--network={ffhq_model}
The generated latent vector projected_w.npz
is then used as input for our blended cartoon model cartoon_ffhq_blended_128.pkl
:
%mkdir -p /content/output
!python generate.py \
--outdir=/content/output \
--projected-w=/content/input/projected_w.npz \
--trunc=0.5 \
--network={project}/blending/cartoon_ffhq_blended_128.pkl
As projection of real faces into latent space takes very long (a magnitude of minutes for a single image), we train an image-to-image translation GAN. Image-to-image translation is the task of taking images from one domain and transforming them so they have the style (or characteristics) of images from another domain.
In order to train an Image-to-Image Translation GAN, we need a large dataset of image pairs of real-faces and its related toonified-faces.
We use the projection script stylegan2-ada-pytorch/pbaylies_projector.py
and our custom cartoon_ffhq_blended_128.pkl
model to create cartoon-faces from a given aligned real-faces
dataset of 3000 images minimum (see step 3.).
If no data of real faces is available, use the StyleGAN2 FFHQ model to generate images of random "real" faces.
Prepare the image pairs using the following file structure:
dataset_psp
├── test
│ ├── reals
│ │ ├── 0001.png
│ │ └── 0002.png
│ └── toons
│ ├── 0001.png
│ └── 0002.png
└── train
├── reals
│ ├── 0003.png
│ └── 0004.png
└── toons
├── 0003.png
└── 0004.png
Then upload as zip file to Google Colab and unzip into /content
:
> zip -r dataset_psp.zip dataset_psp
> gdrive upload -p <folderId> dataset_psp.zip
%cd /content
!cp {project}/dataset_psp.zip .
!unzip -q dataset_psp.zip
In case of a different location or different folder names, update paths in ./pixel2style2pixel/configs/paths_config.py
.
First, download the pretrained IR-SE50 model for ID loss during pSp training and the pretrained FFHQ StyleGAN model:
%cd /content/stylegan2-toonification/pixel2style2pixel/pretrained_models
!gdown 1KW7bjndL3QG3sxBbZxreGHigcCCpsDgn
!gdown 1EM87UquaoQmk17Q8d5kYIAHqu0dkYqdT
Then train the cartoon model. We got best results with these hyper parameters:
%load_ext tensorboard
%tensorboard --logdir {project}/pixel2style2pixel
%cd /content/stylegan2-toonification/pixel2style2pixel
!python scripts/train.py \
--dataset_type='toonify' \
--exp_dir={project}/pixel2style2pixel \
--batch_size=4 \
--start_from_latent_avg \
--lpips_lambda=0.8 \
--l2_lambda=1 \
--id_lambda=1 \
--w_norm_lambda=0.025
Having trained your model, apply it on a real aligned face image.
%cd /content/stylegan2-toonification
!python align_image_data.py --image-source=/content/psp/input/face.jpg
%cd /content/stylegan2-toonification/pixel2style2pixel
!python scripts/inference.py \
--exp_dir=/content/psp/output \
--checkpoint_path={project}/pixel2style2pixel/checkpoints/best_model.pt \
--data_path=/content/psp/input \
--couple_outputs \
--test_batch_size=1