main.4ct

\expandafter\ifx\csname doTocEntry\endcsname\relax \expandafter\endinput\fi 
\doTocEntry\tocsection{}{\csname a:TocLink\endcsname{1}{Q1-1-1}{}{\numberline {1}Introduction}}{2}\relax 
\doTocEntry\toclof{}{\csname a:TocLink\endcsname{1}{x1-5}{}{\numberline {1}{\ignorespaces \relax \fontsize  {9}{10pt}\selectfont  Muse\nobreakspace  {}text-to-image generation ($512 \times 512$ resolution). Under each generated image, the corresponding caption is shown, exhibiting a variety of styles, captions and understanding. Each image was generated in $1.3$s on a TPUv4 chip. \relax }}}{5}\relax 
\doTocEntry\toclof{}{\csname a:TocLink\endcsname{1}{x1-7}{}{\numberline {2}{\ignorespaces \relax \fontsize  {9}{10pt}\selectfont  Examples of zero-shot text-guided image editing using Muse. We show examples of a number of editing applications using the Muse\nobreakspace  {}text-to-image generative model, on \emph  {real} input images, without fine-tuning. All edited images are generated at $512\times 512$ resolution. \relax }}}{8}\relax 
\doTocEntry\toclof{}{\csname a:TocLink\endcsname{1}{x1-12}{}{\numberline {3}{\ignorespaces \relax \fontsize  {9}{10pt}\selectfont  Muse\nobreakspace  {}Framework: We show the training pipeline for our model, with the T5-XXL pre-trained text encoder, base model and super-resolution model depicted on the three rows. The text encoder generates a text embedding that is used for cross-attention with image tokens for both base and super-res Transformer layers. The base model uses a VQ Tokenizer that is pre-trained on lower resolution ($256\times 256$) images and generates a $16\times 16$ latent space of tokens. This sequence is masked at a variable rate per sample and then the cross-entropy loss learns to predict the masked image tokens. Once the base model is trained, the reconstructed lower-resolution tokens and text tokens are passed into the super-res model that then learns to predict masked tokens at a higher resolution. \relax }}}{12}\relax 
\doTocEntry\tocsection{}{\csname a:TocLink\endcsname{1}{Q1-1-2}{}{\numberline {2}Model}}{13}\relax 
\doTocEntry\tocsubsection{}{\csname a:TocLink\endcsname{1}{Q1-1-3}{}{\numberline {2.1}Pre-trained Text Encoders}}{13}\relax 
\doTocEntry\tocsubsection{}{\csname a:TocLink\endcsname{1}{Q1-1-4}{}{\numberline {2.2}Semantic Tokenization using VQGAN}}{13}\relax 
\doTocEntry\tocsubsection{}{\csname a:TocLink\endcsname{1}{Q1-1-5}{}{\numberline {2.3}Base Model}}{13}\relax 
\doTocEntry\tocsubsection{}{\csname a:TocLink\endcsname{1}{Q1-1-6}{}{\numberline {2.4}Super-Resolution Model}}{14}\relax 
\doTocEntry\toclof{}{\csname a:TocLink\endcsname{1}{x1-19}{}{\numberline {4}{\ignorespaces \relax \fontsize  {9}{10pt}\selectfont  Super-resolution Model. On the left is shown the architecture of the super-resolution model. Low-resolution tokens are passed into a series of self-attention Transformer layers; and the resulting output embeddings are concatenated with text embeddings extracted from the conditioning text prompt. Following this, cross-attention is applied from these concatenated embeddings to the masked high-resolution tokens; the loss learns to predict these masked tokens conditioned on the low-resolution and text tokens. On the right are shown two examples of the improvement brought about by the super-resolution model.\relax }}}{16}\relax 
\doTocEntry\tocsubsection{}{\csname a:TocLink\endcsname{1}{Q1-1-7}{}{\numberline {2.5}Decoder Finetuning}}{17}\relax 
\doTocEntry\tocsubsection{}{\csname a:TocLink\endcsname{1}{Q1-1-8}{}{\numberline {2.6}Variable Masking Rate}}{17}\relax 
\doTocEntry\tocsubsection{}{\csname a:TocLink\endcsname{1}{Q1-1-9}{}{\numberline {2.7}Classifier Free Guidance}}{17}\relax 
\doTocEntry\tocsubsection{}{\csname a:TocLink\endcsname{1}{Q1-1-10}{}{\numberline {2.8}Iterative Parallel Decoding at Inference}}{18}\relax 
\doTocEntry\toclof{}{\csname a:TocLink\endcsname{1}{x1-26}{}{\numberline {5}{\ignorespaces \relax \fontsize  {9}{10pt}\selectfont  Inference samples. We visualize the evolution of masked tokens over the sequence of steps for the base model (left) and the super-res model (right). The super-res model, being conditioned on the low-res tokens, requires significantly fewer sampling steps for convergence. \relax }}}{20}\relax 
\doTocEntry\tocsection{}{\csname a:TocLink\endcsname{1}{Q1-1-11}{}{\numberline {3}Results}}{21}\relax 
\doTocEntry\toclof{}{\csname a:TocLink\endcsname{1}{x1-29}{}{\numberline {6}{\ignorespaces \relax \fontsize  {9}{10pt}\selectfont  Examples demonstrating text-to-image capabilities of Muse\nobreakspace  {}for various text properties. Top left: cardinality; top right: composition; middle left: style; middle right: text rendering; and bottom left: usage of the entire prompt. For all examples, $16$ instances per prompt were generated, and the one with the highest CLIP score \citep  {clip} was chosen. Bottom right: examples of generated image failure in Muse\nobreakspace  {}for various text properties such as direct rendering of long phrases, high cardinalities, and multiple cardinalities.\relax }}}{23}\relax 
\doTocEntry\toclof{}{\csname a:TocLink\endcsname{1}{x1-31}{}{\numberline {7}{\ignorespaces \relax \fontsize  {9}{10pt}\selectfont  Comparing the same prompts across DALL-E2 \citep  {dalle2} (left), Imagen \citep  {imagen} (middle) and Muse\nobreakspace  {}(right). \relax }}}{26}\relax 
\doTocEntry\tocsubsection{}{\csname a:TocLink\endcsname{1}{Q1-1-12}{}{\numberline {3.1}Qualitative Performance}}{27}\relax 
\doTocEntry\tocsubsection{}{\csname a:TocLink\endcsname{1}{Q1-1-13}{}{\numberline {3.2}Quantitative Performance}}{27}\relax 
\doTocEntry\toclot{}{\csname a:TocLink\endcsname{1}{x1-35}{}{\numberline {1}{\ignorespaces \relax \fontsize  {9}{10pt}\selectfont  Quantitative evaluation on CC3M \citep  {sharma2018conceptual}; all models are trained and evaluated on CC3M.\relax }}}{29}\relax 
\doTocEntry\toclot{}{\csname a:TocLink\endcsname{1}{x1-37}{}{\numberline {2}{\ignorespaces \relax \fontsize  {9}{10pt}\selectfont  Quantitative evaluation of FID and CLIP score (where available) on MS-COCO \citep  {coco} for $256\times 256$ image resolution. Muse\nobreakspace  {} achieves a CLIP score of 0.32, higher than the score of 0.27 reported in Imagen. Other papers in the table above did not report a CLIP score.\relax }}}{32}\relax 
\doTocEntry\toclof{}{\csname a:TocLink\endcsname{1}{x1-39}{}{\numberline {8}{\ignorespaces CLIP vs. FID tradeoff curve. We perform sweeps of sampling parameters for a fixed model, then plot the Pareto front.\relax }}}{35}\relax 
\doTocEntry\toclof{}{\csname a:TocLink\endcsname{1}{x1-41}{}{\numberline {9}{\ignorespaces \relax \fontsize  {9}{10pt}\selectfont  Percentage of prompts for which a human rater consensus chose a model alignment preference. Contributions from specific numbers of rater consensuses are shown in different colors, while marginals over consensuses ($=5$, $\geq 4$, and $\geq 3$) are shown numerically.\relax }}}{35}\relax 
\doTocEntry\tocsubsubsection{}{\csname a:TocLink\endcsname{1}{Q1-1-14}{}{\numberline {3.2.1}Human evaluation}}{36}\relax 
\doTocEntry\tocsubsubsection{}{\csname a:TocLink\endcsname{1}{Q1-1-15}{}{\numberline {3.2.2}Inference Speed}}{37}\relax 
\doTocEntry\tocsubsection{}{\csname a:TocLink\endcsname{1}{Q1-1-16}{}{\numberline {3.3}Image Editing}}{37}\relax 
\doTocEntry\toclot{}{\csname a:TocLink\endcsname{1}{x1-45}{}{\numberline {3}{\ignorespaces \relax \fontsize  {9}{10pt}\selectfont  Per-batch inference time for several models. Muse, Imagen, and Parti were benchmarked internally on TPUv4 hardware. Stable Diffusion/LDM benchmark from \citep  {sdinference}, on A100 GPUs. The ``LDM (250 steps)'' time comes from scaling the 50-step time by 5; 250 steps were used to achieve the FID in \cref  {tab:eval_coco}.\relax }}}{37}\relax 
\doTocEntry\toclof{}{\csname a:TocLink\endcsname{1}{x1-49}{}{\numberline {10}{\ignorespaces \relax \fontsize  {9}{10pt}\selectfont  Examples of text-guided inpainting. The mask is shown in the second column of each row. This behavior arises directly from the model with no fine-tuning.\relax }}}{39}\relax 
\doTocEntry\tocsubsubsection{}{\csname a:TocLink\endcsname{1}{Q1-1-17}{}{\numberline {3.3.1}Text-guided Inpainting / outpainting}}{40}\relax 
\doTocEntry\tocsubsubsection{}{\csname a:TocLink\endcsname{1}{Q1-1-18}{}{\numberline {3.3.2}Zero-shot Mask-free editing}}{40}\relax 
\doTocEntry\toclof{}{\csname a:TocLink\endcsname{1}{x1-53}{}{\numberline {11}{\ignorespaces Examples of zero-shot mask-free image editing, post superres. We see that the pose and overall structure of the image is maintained while changing some specific aspects of the object based on the text prompt.\relax }}}{42}\relax 
\doTocEntry\toclof{}{\csname a:TocLink\endcsname{1}{x1-55}{}{\numberline {12}{\ignorespaces Intermediate iterations producing one of the edits in \cref  {fig:mfe_gallery} (pre-superres)\relax }}}{45}\relax 
\doTocEntry\tocsection{}{\csname a:TocLink\endcsname{1}{Q1-1-19}{}{\numberline {4}Related Work}}{46}\relax 
\doTocEntry\tocsubsection{}{\csname a:TocLink\endcsname{1}{Q1-1-20}{}{\numberline {4.1}Image Generation Models}}{46}\relax 
\doTocEntry\tocsubsection{}{\csname a:TocLink\endcsname{1}{Q1-1-21}{}{\numberline {4.2}Image Tokenizers}}{46}\relax 
\doTocEntry\tocsubsection{}{\csname a:TocLink\endcsname{1}{Q1-1-22}{}{\numberline {4.3}Large Language Models}}{46}\relax 
\doTocEntry\tocsubsection{}{\csname a:TocLink\endcsname{1}{Q1-1-23}{}{\numberline {4.4}Text-Image Models}}{47}\relax 
\doTocEntry\tocsubsection{}{\csname a:TocLink\endcsname{1}{Q1-1-24}{}{\numberline {4.5}Image Editing with Generative Models}}{47}\relax 
\doTocEntry\tocsection{}{\csname a:TocLink\endcsname{1}{Q1-1-25}{}{\numberline {5}Discussion and Social Impact}}{47}\relax 
\doTocEntry\toclikesection{}{\csname a:TocLink\endcsname{1}{x1-10005}{QQ2-1-26}{Acknowledgements}}{48}\relax 
\doTocEntry\toclikesection{}{\csname a:TocLink\endcsname{1}{x1-20005}{QQ2-1-27}{References\markboth {\MakeUppercase  {References}}{\MakeUppercase  {References}}}}{48}\relax 
\doTocEntry\tocsection{}{\csname a:TocLink\endcsname{1}{Q1-1-28}{}{\numberline {A}Appendix.}}{57}\relax 
\doTocEntry\tocsubsection{}{\csname a:TocLink\endcsname{1}{Q1-1-29}{}{\numberline {A.1}Base Model Configurations}}{57}\relax 
\doTocEntry\toclot{}{\csname a:TocLink\endcsname{1}{x1-2004}{}{\numberline {4}{\ignorespaces Configuration and training hyperparameters for base model.\relax }}}{59}\relax 
\doTocEntry\tocsubsection{}{\csname a:TocLink\endcsname{1}{Q1-1-30}{}{\numberline {A.2}VQGAN Configurations}}{60}\relax 
\doTocEntry\toclot{}{\csname a:TocLink\endcsname{1}{x1-2007}{}{\numberline {5}{\ignorespaces Configuration and training hyperparameters for VQGAN.\relax }}}{62}\relax 
\doTocEntry\toclof{}{\csname a:TocLink\endcsname{1}{x1-2009}{}{\numberline {13}{\ignorespaces \relax \fontsize  {9}{10pt}\selectfont  Visual example of the improvement from the fine-tuned decoder (\cref  {sec:dec_finetune}). Please zoom in by at least 200% to see the difference between the VQGAN reconstruction and the reconstruction with a finetuned decoder. We can see especially that fine details such as the house number (bottom left), the storefront sign (middle) and the bars on the windows (right) are better preserved in the finetuned decoder.\relax }}}{65}\relax 
\doTocEntry\tocsubsection{}{\csname a:TocLink\endcsname{1}{Q1-1-31}{}{\numberline {A.3}Super Resolution Configurations}}{66}\relax 
\doTocEntry\toclot{}{\csname a:TocLink\endcsname{1}{x1-2012}{}{\numberline {6}{\ignorespaces Configuration and training hyperparameters for the Super-Resolution Model.\relax }}}{68}\relax 
\par