A collection of some large model papers.
-
DALL-E-2: Hierarchical Text-Conditional Image Generation with CLIP Latents
DALL-E-2 is a two-stage model: a prior that generates a CLIP image embedding given a text caption, and a decoder that generates an image conditioned on the image embedding. paper
-
Stable Diffusion: Hierarchical Text-Conditional Image Generation with CLIP Latents
Latent diffusion model (Base version of Stable Diffusion) use diffusion model to model latent space of image, and introduces cross-attention layers into the model architecture to enable conditional generation. paper
-
Imagen: Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding
Imagen builds on the power of large transformer language models(e.g. T5) in understanding text and uses cascaded diffusion models to generate high-fidelity image. paper
-
DALL-E: Zero-Shot Text-to-Image Generation
Use transformer to autoregressively model text and image tokens as a single stream of data. paper
-
Parti: Scaling Autoregressive Models for Content-Rich Text-to-Image Generation
Parti treats text-to-image generation as a sequence-to-sequence modeling problem, akin to machine translation, with sequences of image tokens as the target outputs rather than text tokens in another language. paper
-
Muse: Muse: Text-To-Image Generation via Masked Generative Transformers
Given the text embedding extracted from a pre-trained large language model (LLM), Muse is trained to predict randomly masked image tokens. Compared with diffusion or autoregressive models, Muse is more efficient. paper
- UniDiffuser: UniDiffuser claims that learning diffusion models for marginal, conditional, and joint distributions can be unified as predicting the noise in the perturbed data, where the perturbation levels (i.e. timesteps) can be different for different modalities. UniDiffuser is able to perform image, text, text-to-image, image-to-text, and image-text pair generation by setting proper timesteps without additional overhead. paper
-
CLIP: Learning Transferable Visual Models From Natural Language Supervision
The pre-training task of CLIP is predicting which caption goes with which image through contrast learning loss. The model transfers non-trivially to most tasks and is often competitive with a fully supervised baseline without the need for any dataset specific training. paper
-
BEiT 3: Image as a Foreign Language: BEIT Pretraining for All Vision and Vision-Language Tasks BEiT 3 introduce Multiway Transformers for general-purpose model, and use maskd "language" modeling on images, texts and imgae-text pairs. paper
-
GPT-3: Language Models are Few-Shot Learners
GPT-3 is trained to predict the next word in a sentences. However, model developers and early users demonstrated that it had surprising capabilities, like the ability to write convincing essays, create charts and websites from text descriptions, generate computer code, and more — all with limited to no supervision. paper