Task: retaining the expressivity of transformer for high-resolution synthesis in VQVAE.
Main novelty: integrating transformer to learn long-range interactions within global compositions (image synthesis) / a wide range of applications of image synthesis
Approach: building backbone with AE / embedding discretization / synthesis by compositing embeddings in auto-regressive way / conditioning in similar fashion to unconditional setup.
-
Learn auto-encoder with perceptual loss and patch-based discriminator to keep good synthesis quality.
-
Discretize embeddings into codebook by element-wise quantization to reduce the search space of compositions (the length of codebook is set to 16x16).
-
Generate images by choosing ordering of embeddings as autoregressive next-index prediction: Given indices
, the transformer learns to predict the distribution of possible next indices, i.e. $p(s_i|s_{<i})$to compute the likelihood of the full representation as . The training proceeds by maximizing the log-likelihood of . -
For conditioning, obtain again a codebook for conditional information and simply prepend its corresponding embeddings to
. This matches with the negative log-likelihood of entries and allows to compute the likelihood of the sequence as .
Applications: synthesis across different conditioning inputs (depth-to-image, symantic layout-to-image, image completion, pose-guided person generation, class conditional, semantic segmentation mask)
총평: 단순한 아이디어고 이전 논문에 얹혀가는 느낌이 없잖아 있지만 응용할 수 있는 부분이 많고 실험 결과가 크게 좋아져 임팩트가 큰 논문.