- Cartoons are a very popular art form that has been widely applied in diverse
scenes, from publication in printed media to storytelling for children. Some cartoon
artwork was created based on real-world scenes. However, manually re-creating
real-life based scenes can be very laborious and requires refined skills.
The evolution in the field of Machine Learning has expanded the possibilities of creating visual arts. Some famous products have been created by turning real-world photography into usable cartoon scene materials, where the process is called image cartoonization.
White box cartoonization is a method that reconstructs high-quality real-life pictures into exceptional cartoon images using the GAN framework.
- INTRODUCTION TO GENERATIVE ADVERSARIAL NETWORKS (GANs)
- Generative adversarial networks (GANs) are an exciting recent innovation in machine learning. GANs are generative models: they create new data instances that resemble your training data. For example, GANs can create images that look like photographs of human faces, even though the faces don't belong to any real person.
- Generative Adversarial Networks, or GANs for short, are an approach to generative modelling using deep learning methods, such as convolutional neural networks. Generative modelling is an unsupervised learning task in machine learning that involves automatically discovering and learning the regularities or patterns in input data in such a way that the model can be used to generate or output new examples that plausibly could have been drawn from the original dataset.
- GANs are a clever way of training a generative model by framing the problem as a supervised learning problem with two sub-models: the generator model that we train to generate new examples, and the discriminator model that tries to classify examples as either real (from the domain) or fake (generated). The two models are trained together in a zero-sum game, adversarial, until the discriminator model is fooled about half the time, meaning the generator model is generating plausible examples.
- GANs are an exciting and rapidly changing field, delivering on the promise of generative models in their ability to generate realistic examples across a range of problem domains, most notably in image-to-image translation tasks such as translating photos of summer to winter or day to night, and in generating photorealistic photos of objects, scenes, and people that even humans cannot tell are fake.
- DEEP CONVOLUTIONAL GENERATIVE ADVERSARIAL NETWORKS (DCGAN) :
- The deep convolutional generative adversarial network, or DCGAN for short, is an extension of the GAN architecture for using deep convolutional neural networks for both the generator and discriminator models and configurations for the models and training that result in the stable training of a generator model.
- The DCGAN is important because it suggested the constraints on the model required to e ectively develop high-quality generator models in practice. This architecture, in turn, provided the basis for the rapid development of a large number of GAN extensions and applications.
- ARCHITECTURE OF GAN
- The architecture of a GAN has two basic elements: the generator network and the discriminator network. Each network can be any neural network, such as an Artificial Neural Network (ANN), a Convolutional Neural Network (CNN), a Recurrent Neural Network (RNN), or a Long Short Term Memory (LSTM). The discriminator has to have fully connected layers with a classifier at the end.
- A generative adversarial network (GAN) has two parts:
-
- The generator learns to generate plausible data. The generated instances become negative training examples for the discriminator.
- The discriminator learns to distinguish the generator's fake data from real data. The discriminator penalizes the generator for producing implausible results
- When training begins, the generator produces fake data, and the discriminator quickly learns to tell that it's fake:
- As training progresses, the generator gets closer to producing output that can fool the discriminator:
- Finally, if generator training goes well, the discriminator gets worse at telling the di erence between real and fake. It starts to classify fake data as real, and its accuracy decreases.
- Here's a picture of the whole system:
- Both the generator and the discriminator are neural networks. The generator output is connected directly to the discriminator input. Through backpropagation, the discriminator's classification provides a signal that the generator uses to update its weights.
- ------------------------------------- THE DISCRIMINATOR ---------------------------------
- The discriminator in a GAN is simply a classifier. It tries to distinguish real data from the data created by the generator. It could use any network architecture appropriate to the type of data it's classifying.
- DISCRIMINATOR TRAINING DATA : The discriminator's training data comes from two sources:
- Real data instances, such as real pictures of people. The discriminator uses these instances as positive examples during training.
- Fake data instances created by the generator. The discriminator uses these instances as negative examples during training.
- TRAINING THE DISCRIMINATOR
- The discriminator connects to two loss functions. During discriminator
training, the discriminator ignores the generator loss and just uses the
discriminator loss. We use the generator loss during generator training, as
described in the next section.
During discriminator training:
- The discriminator classifies both real data and fake data from the generator.
- The discriminator loss penalizes the discriminator for misclassifying a real instance as fake or a fake instance as real.
- The discriminator updates its weights through backpropagation from the discriminator loss through the discriminator network.
- ----------------------- THE GENERATOR -----------------------
- The generator part of a GAN learns to create fake data by incorporating
feedback from the discriminator. It learns to make the discriminator classify its
output as real.
Generator training requires tighter integration between the generator and the discriminator than discriminator training requires. The portion of the GAN that trains the generator includes: ● random input ● generator network, which transforms the random input into a data instance ● discriminator network, which classifies the generated data ● discriminator output ● generator loss, which penalizes the generator for failing to fool the discriminator
- USING THE DISCRIMINATOR TO TRAIN THE GENERATOR
- Sample random noise.
- Produce generator output from sampled random noise.
- Get discriminator "Real" or "Fake" classification for generator output.
- Calculate loss from discriminator classification.
- Backpropagate through both the discriminator and generator to obtain gradients.
- Use gradients to change only the generator weights.
- The surface representation contains a smooth surface of cartoon images.
- The structure representation refers to the sparse colour blocks and flattens global content in the celluloid style workflow.
- The texture representation reflects high-frequency texture, contours, and details in cartoon images.
- A Generative Adversarial Network (GAN) framework is used to learn the extracted representations and to cartoonize images.
- The surface representation imitates the cartoon painting style where artists roughly draw drafts with coarse brushes and have smooth surfaces similar to cartoon images.
- To smooth images and meanwhile keep the global semantic structure, a di erentiable guided filter is adopted for edge-preserving filtering
- Edge-preserving filtering is an image processing technique that smooths away noise or textures while retaining sharp edges. Examples are the median, bilateral, guided, and anisotropic di usion filters.
- Note: A discriminator Ds is introduced to judge whether model outputs and reference cartoon images have similar surfaces, and guide generator G to learn the information stored in the extracted surface representation.
- We at first used the felzenszwalb algorithm to segment images into separate regions. As superpixel algorithms only consider the similarity of pixels and ignore semantic information, we further introduce selective search to merge segmented regions and extract a sparse segmentation map.
- Standard superpixel algorithms colour each segmented region with an average of the pixel value. We found this lowers global contrast, darkens images, and causes a hazing e ect on the final results by analysing the processed dataset. We thus propose an adaptive colouring algorithm
- Adaptive coloring formula,
- (θ1, θ2) = (0, 1) σ(S) < γ1
- (0.5, 0.5) γ1 < σ(S) < γ2
- (1, 0) γ2 < σ(S).
- Where, G = Generator
- Ip = Input Photo
- Fst = Structure Representation Extraction.
- The high-frequency features of cartoon images are key learning objectives, but luminance and colour information make it easy to distinguish between cartoon images and real-world photos. We thus propose a random colour shift algorithm. The random colour shift can generate random intensity maps with luminance and colour information removed.
- Frcs extract single-channel texture representation from colour images, which retains high-frequency textures and decreases the influence of colour and luminance.
- Note: We set α = 0.8, β1, β2 and β3 ∼ U(−1, 1).
- Where,
G = Generator, Dt = Discriminator, Ic = Reference Cartoon Image, Ip = Input Photo, Frcs = Extract single-channel texture representation from colour images, which retains high-frequency textures and decreases the influence of colour and luminance.
In Figure 1, the two "Sample" boxes represent these two data sources feeding into the discriminator. During discriminator training, the generator does not train. Its weights remain constant while it produces examples for the discriminator to train on.
- To train a neural net, we alter the net's weights to reduce the error or loss of
its output. In our GAN, however, the generator is not directly connected to the loss
that we're trying to a ect. The generator feeds into the discriminator net, and the
discriminator produces the output we're trying to a ect. The generator loss
penalizes the generator for producing a sample that the discriminator network
classifies as fake. Backpropagation adjusts each weight in the right direction by calculating the
weight's impact on the output — how the output would change if you changed the
weight. But the impact of a generator weight depends on the impact of the
discriminator weights it feeds into. So backpropagation starts at the output and
flows back through the discriminator into the generator.
At the same time, we don't want the discriminator to change during generator training. Trying to hit a moving target would make a hard problem even harder for the generator. So we train the generator with the following procedure:
- We propose to separately identify three white-box representations from
images:
1. The surface representation
2. The structure representation
3. The texture representation
Where,
-
G = Generator,
Ds = Discriminator,
Ic = Reference Cartoon Image,
Ip = Input Photo,
Fdgf = It takes an image I as input and itself as a guide map,
returns extracted surface representation Fdgf (I, I) with
textures and details removed.
Where we find γ1 = 20, γ2 = 40 and µ = 1.2 generate good results.
Note: We use high-level features extracted by a pre-trained VGG16 network to enforce spatial constraints between our results and extracted structure representation.
Where, Irgb represents 3-channel RGB colour images, Ir, Ig and Ib represent three colour channels, and Y represents standard grayscale image converted from RGB colour image.