BigModelPapers

A collection of some large model papers.

Paper list

Text-to-image model

Diffusion model

DALL-E-2: Hierarchical Text-Conditional Image Generation with CLIP Latents

DALL-E-2 is a two-stage model: a prior that generates a CLIP image embedding given a text caption, and a decoder that generates an image conditioned on the image embedding. paper
Stable Diffusion: Hierarchical Text-Conditional Image Generation with CLIP Latents

Latent diffusion model (Base version of Stable Diffusion) use diffusion model to model latent space of image, and introduces cross-attention layers into the model architecture to enable conditional generation. paper
Imagen: Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding

Imagen builds on the power of large transformer language models(e.g. T5) in understanding text and uses cascaded diffusion models to generate high-fidelity image. paper

Autoregressive models

DALL-E: Zero-Shot Text-to-Image Generation

Use transformer to autoregressively model text and image tokens as a single stream of data. paper
Parti: Scaling Autoregressive Models for Content-Rich Text-to-Image Generation

Parti treats text-to-image generation as a sequence-to-sequence modeling problem, akin to machine translation, with sequences of image tokens as the target outputs rather than text tokens in another language. paper

Others

Muse: Muse: Text-To-Image Generation via Masked Generative Transformers

Given the text embedding extracted from a pre-trained large language model (LLM), Muse is trained to predict randomly masked image tokens. Compared with diffusion or autoregressive models, Muse is more efficient. paper

Unified generative model

UniDiffuser: UniDiffuser claims that learning diffusion models for marginal, conditional, and joint distributions can be unified as predicting the noise in the perturbed data, where the perturbation levels (i.e. timesteps) can be different for different modalities. UniDiffuser is able to perform image, text, text-to-image, image-to-text, and image-text pair generation by setting proper timesteps without additional overhead. paper

Vision-Language Pre-training model

CLIP: Learning Transferable Visual Models From Natural Language Supervision

The pre-training task of CLIP is predicting which caption goes with which image through contrast learning loss. The model transfers non-trivially to most tasks and is often competitive with a fully supervised baseline without the need for any dataset specific training. paper
BEiT 3: Image as a Foreign Language: BEIT Pretraining for All Vision and Vision-Language Tasks BEiT 3 introduce Multiway Transformers for general-purpose model, and use maskd "language" modeling on images, texts and imgae-text pairs. paper

Large Language model

GPT-3: Language Models are Few-Shot Learners

GPT-3 is trained to predict the next word in a sentences. However, model developers and early users demonstrated that it had surprising capabilities, like the ability to write convincing essays, create charts and websites from text descriptions, generate computer code, and more — all with limited to no supervision. paper

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BigModelPapers

Paper list

Text-to-image model

Diffusion model

Autoregressive models

Others

Unified generative model

Vision-Language Pre-training model

Large Language model

About

Releases

Packages

ML-GSAI/BigModelPapers

Folders and files

Latest commit

History

Repository files navigation

BigModelPapers

Paper list

Text-to-image model

Diffusion model

Autoregressive models

Others

Unified generative model

Vision-Language Pre-training model

Large Language model

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages