Skip to content

ML-GSAI/BigModelPapers

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 

Repository files navigation

BigModelPapers

A collection of some large model papers.

Paper list

Text-to-image model

Diffusion model

  • DALL-E-2: Hierarchical Text-Conditional Image Generation with CLIP Latents

    DALL-E-2 is a two-stage model: a prior that generates a CLIP image embedding given a text caption, and a decoder that generates an image conditioned on the image embedding. paper

  • Stable Diffusion: Hierarchical Text-Conditional Image Generation with CLIP Latents

    Latent diffusion model (Base version of Stable Diffusion) use diffusion model to model latent space of image, and introduces cross-attention layers into the model architecture to enable conditional generation. paper

  • Imagen: Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding

    Imagen builds on the power of large transformer language models(e.g. T5) in understanding text and uses cascaded diffusion models to generate high-fidelity image. paper

Autoregressive models

  • DALL-E: Zero-Shot Text-to-Image Generation

    Use transformer to autoregressively model text and image tokens as a single stream of data. paper

  • Parti: Scaling Autoregressive Models for Content-Rich Text-to-Image Generation

    Parti treats text-to-image generation as a sequence-to-sequence modeling problem, akin to machine translation, with sequences of image tokens as the target outputs rather than text tokens in another language. paper

Others

  • Muse: Muse: Text-To-Image Generation via Masked Generative Transformers

    Given the text embedding extracted from a pre-trained large language model (LLM), Muse is trained to predict randomly masked image tokens. Compared with diffusion or autoregressive models, Muse is more efficient. paper

Unified generative model

  • UniDiffuser: UniDiffuser claims that learning diffusion models for marginal, conditional, and joint distributions can be unified as predicting the noise in the perturbed data, where the perturbation levels (i.e. timesteps) can be different for different modalities. UniDiffuser is able to perform image, text, text-to-image, image-to-text, and image-text pair generation by setting proper timesteps without additional overhead. paper

Vision-Language Pre-training model

  • CLIP: Learning Transferable Visual Models From Natural Language Supervision

    The pre-training task of CLIP is predicting which caption goes with which image through contrast learning loss. The model transfers non-trivially to most tasks and is often competitive with a fully supervised baseline without the need for any dataset specific training. paper

  • BEiT 3: Image as a Foreign Language: BEIT Pretraining for All Vision and Vision-Language Tasks BEiT 3 introduce Multiway Transformers for general-purpose model, and use maskd "language" modeling on images, texts and imgae-text pairs. paper

Large Language model

  • GPT-3: Language Models are Few-Shot Learners

    GPT-3 is trained to predict the next word in a sentences. However, model developers and early users demonstrated that it had surprising capabilities, like the ability to write convincing essays, create charts and websites from text descriptions, generate computer code, and more — all with limited to no supervision. paper

About

A collection of some large model papers.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published