- Title: InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets
- Authors: Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, Pieter Abbeel
- Link: https://arxiv.org/abs/1606.03657
- Tags: Neural Network, GAN
- Year: 2016
-
What
- Usually GANs transform a noise vector
z
into images.z
might be sampled from a normal or uniform distribution. - The effect of this is, that the components in
z
are deeply entangled.- Changing single components has hardly any influence on the generated images. One has to change multiple components to affect the image.
- The components end up not being interpretable. Ideally one would like to have meaningful components, e.g. for human faces one that controls the hair length and a categorical one that controls the eye color.
- They suggest a change to GANs based on Mutual Information, which leads to interpretable components.
- E.g. for MNIST a component that controls the stroke thickness and a categorical component that controls the digit identity (1, 2, 3, ...).
- These components are learned in a (mostly) unsupervised fashion.
- Usually GANs transform a noise vector
-
How
- The latent code
c
- "Normal" GANs parameterize the generator as
G(z)
, i.e. G receives a noise vector and transforms it into an image. - This is changed to
G(z, c)
, i.e. G now receives a noise vectorz
and a latent codec
and transforms both into an image. c
can contain multiple variables following different distributions, e.g. in MNIST a categorical variable for the digit identity and a gaussian one for the stroke thickness.
- "Normal" GANs parameterize the generator as
- Mutual Information
- If using a latent code via
G(z, c)
, nothing forces the generator to actually usec
. It can easily ignore it and just deteriorate toG(z)
. - To prevent that, they force G to generate images
x
in a way thatc
must be recoverable. So, if you have an imagex
you must be able to reliable tell which latent codec
it has, which means that G must usec
in a meaningful way. - This relationship can be expressed with mutual information, i.e. the mutual information between
x
andc
must be high.- The mutual information between two variables X and Y is defined as
I(X; Y) = entropy(X) - entropy(X|Y) = entropy(Y) - entropy(Y|X)
. - If the mutual information between X and Y is high, then knowing Y helps you to decently predict the value of X (and the other way round).
- If the mutual information between X and Y is low, then knowing Y doesn't tell you much about the value of X (and the other way round).
- The mutual information between two variables X and Y is defined as
- The new GAN loss becomes
old loss - lambda * I(G(z, c); c)
, i.e. the higher the mutual information, the lower the result of the loss function.
- If using a latent code via
- Variational Mutual Information Maximization
- In order to minimize
I(G(z, c); c)
, one has to know the distributionP(c|x)
(from image to latent code), which however is unknown. - So instead they create
Q(c|x)
, which is an approximation ofP(c|x)
. I(G(z, c); c)
is then computed using a lower bound maximization, similar to the one in variational autoencoders (called "Variational Information Maximization", hence the name "InfoGAN").- Basic equation:
LowerBoundOfMutualInformation(G, Q) = E[log Q(c|x)] + H(c) <= I(G(z, c); c)
c
is the latent code.x
is the generated image.H(c)
is the entropy of the latent codes (constant throughout the optimization).- Optimization w.r.t. Q is done directly.
- Optimization w.r.t. G is done via the reparameterization trick.
- If
Q(c|x)
approximatesP(c|x)
perfectly, the lower bound becomes the mutual information ("the lower bound becomes tight"). - In practice,
Q(c|x)
is implemented as a neural network. Both Q and D have to process the generated images, which means that they can share many convolutional layers, significantly reducing the extra cost of training Q.
- In order to minimize
- The latent code
-
Results
- MNIST
- They use for
c
one categorical variable (10 values) and two continuous ones (uniform between -1 and +1). - InfoGAN learns to associate the categorical one with the digit identity and the continuous ones with rotation and width.
- Applying Q(c|x) to an image and then classifying only on the categorical variable (i.e. fully unsupervised) yields 95% accuracy.
- Sampling new images with exaggerated continuous variables in the range
[-2,+2]
yields sound images (i.e. the network generalizes well).
- They use for
- 3D face images
- InfoGAN learns to represent the faces via pose, elevation, lighting.
- They used five uniform variables for
c
. (So two of them apparently weren't associated with anything sensible? They are not mentioned.)
- 3D chair images
- InfoGAN learns to represent the chairs via identity (categorical) and rotation or width (apparently they did two experiments).
- They used one categorical variable (four values) and one continuous variable (uniform
[-1, +1]
).
- SVHN
- InfoGAN learns to represent lighting and to spot the center digit.
- They used four categorical variables (10 values each) and two continuous variables (uniform
[-1, +1]
). (Again, a few variables were apparently not associated with anything sensible?)
- CelebA
- MNIST