Skip to content

Files

Latest commit

d79a570 · Aug 3, 2021

History

History
26 lines (13 loc) · 1.93 KB

File metadata and controls

26 lines (13 loc) · 1.93 KB

Taming Transformers for High-Resolution Image Synthesis

Patrick Esser and Robin Rombach et al., Heidelberg University) - CVPR 2021 (oral)

Summarized by Woohyeon Sim


Task: retaining the expressivity of transformer for high-resolution synthesis in VQVAE.

Main novelty: integrating transformer to learn long-range interactions within global compositions (image synthesis) / a wide range of applications of image synthesis

Approach: building backbone with AE / embedding discretization / synthesis by compositing embeddings in auto-regressive way / conditioning in similar fashion to unconditional setup.

  • Learn auto-encoder with perceptual loss and patch-based discriminator to keep good synthesis quality.

  • Discretize embeddings into codebook by element-wise quantization to reduce the search space of compositions (the length of codebook is set to 16x16).

  • Generate images by choosing ordering of embeddings as autoregressive next-index prediction: Given indices s < i , the transformer learns to predict the distribution of possible next indices, i.e. $p(s_i|s_{<i})$to compute the likelihood of the full representation as p ( s ) = i p ( s i | s < i ) . The training proceeds by maximizing the log-likelihood of p ( s ) .

  • For conditioning, obtain again a codebook for conditional information and simply prepend its corresponding embeddings to s . This matches with the negative log-likelihood of entries p ( s i | s < i , c ) and allows to compute the likelihood of the sequence as p ( s ) = i p ( s i | s < i , c ) .

Applications: synthesis across different conditioning inputs (depth-to-image, symantic layout-to-image, image completion, pose-guided person generation, class conditional, semantic segmentation mask)

총평: 단순한 아이디어고 이전 논문에 얹혀가는 느낌이 없잖아 있지만 응용할 수 있는 부분이 많고 실험 결과가 크게 좋아져 임팩트가 큰 논문.