22 Oct 14:15

89e4d62

v0.31.0 Latest

Latest

v0.31.0: Stable Diffusion 3.5 Large, CogView3, Quantization, Training Scripts, and more

Stable Diffusion 3.5 Large

Stability AI’s latest text-to-image generation model is Stable Diffusion 3.5 Large. SD3.5 Large is the next iteration of Stable Diffusion 3. It comes with two checkpoints (both of which have 8B params):

A regular one
A timestep-distilled one enabling few-step inference

Make sure to fill up the form by going to the model page, and then run huggingface-cli login before running the code below.

# make sure to update diffusers
# pip install -U diffusers
import torch
from diffusers import StableDiffusion3Pipeline

pipe = StableDiffusion3Pipeline.from_pretrained(
	"stabilityai/stable-diffusion-3.5-large", torch_dtype=torch.bfloat16
).to("cuda")

image = pipe(
    prompt="a photo of a cat holding a sign that says hello world",
    negative_prompt="",
    num_inference_steps=40,
    height=1024,
    width=1024,
    guidance_scale=4.5,
).images[0]

image.save("sd3_hello_world.png")

Follow the documentation to know more.

Cogview3-plus

We added a new text-to-image model, Cogview3-plus, from the THUDM team! The model is DiT-based and supports image generation from 512 to 2048px. Thanks to @zRzRzRzRzRzRzR for contributing it!

from diffusers import CogView3PlusPipeline
import torch

pipe = CogView3PlusPipeline.from_pretrained("THUDM/CogView3-Plus-3B", torch_dtype=torch.float16).to("cuda")

# Enable it to reduce GPU memory usage
pipe.enable_model_cpu_offload()
pipe.vae.enable_slicing()
pipe.vae.enable_tiling()

prompt = "A vibrant cherry red sports car sits proudly under the gleaming sun, its polished exterior smooth and flawless, casting a mirror-like reflection. The car features a low, aerodynamic body, angular headlights that gaze forward like predatory eyes, and a set of black, high-gloss racing rims that contrast starkly with the red. A subtle hint of chrome embellishes the grille and exhaust, while the tinted windows suggest a luxurious and private interior. The scene conveys a sense of speed and elegance, the car appearing as if it's about to burst into a sprint along a coastal road, with the ocean's azure waves crashing in the background."

image = pipe(
    prompt=prompt,
    guidance_scale=7.0,
    num_images_per_prompt=1,
    num_inference_steps=50,
    width=1024,
    height=1024,
).images[0]

image.save("cogview3.png")

Refer to the documentation to know more.

Quantization

We have landed native quantization support in Diffusers, starting with bitsandbytes as its first quantization backend. With this, we hope to see large diffusion models becoming much more accessible to run on consumer hardware.

The example below shows how to run Flux.1 Dev with the NF4 data-type. Make sure you install the libraries:

pip install -Uq git+https://github.com/huggingface/transformers@main
pip install -Uq bitsandbytes
pip install -Uq diffusers

from diffusers import BitsAndBytesConfig, FluxTransformer2DModel
import torch

ckpt_id = "black-forest-labs/FLUX.1-dev"
nf4_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)
model_nf4 = FluxTransformer2DModel.from_pretrained(
    ckpt_id,
    subfolder="transformer",
    quantization_config=nf4_config,
    torch_dtype=torch.bfloat16
)

Then, we use model_nf4 to instantiate the FluxPipeline:

from diffusers import FluxPipeline

pipeline = StableDiffusion3Pipeline.from_pretrained(
    ckpt_id, 
    transformer=model_nf4,
    torch_dtype=torch.bfloat16
)
pipeline.enable_model_cpu_offload()

prompt = "A whimsical and creative image depicting a hybrid creature that is a mix of a waffle and a hippopotamus, basking in a river of melted butter amidst a breakfast-themed landscape. It features the distinctive, bulky body shape of a hippo. However, instead of the usual grey skin, the creature's body resembles a golden-brown, crispy waffle fresh off the griddle. The skin is textured with the familiar grid pattern of a waffle, each square filled with a glistening sheen of syrup. The environment combines the natural habitat of a hippo with elements of a breakfast table setting, a river of warm, melted butter, with oversized utensils or plates peeking out from the lush, pancake-like foliage in the background, a towering pepper mill standing in for a tree.  As the sun rises in this fantastical world, it casts a warm, buttery glow over the scene. The creature, content in its butter river, lets out a yawn. Nearby, a flock of birds take flight"

image = pipeline(
    prompt=prompt,
    negative_prompt="",
    num_inference_steps=50,
    guidance_scale=4.5,
    max_sequence_length=512,
).images[0]
image.save("whimsical.png")

Follow the documentation here to know more. Additionally, check out this Colab Notebook that runs Flux.1 Dev in an end-to-end manner with NF4 quantization.

Training scripts

We have a fresh bucket of training scripts with this release:

Video model fine-tuning can be quite expensive. So, we have worked on a repository, cogvideox-factory, which provides memory-optimized scripts to fine-tune the Cog family of models.

Misc

We now support the loading of different kinds of Flux LoRAs, including Kohya, TheLastBen, and Xlabs.
Loading of Xlabs Flux ControlNets is also now supported. Thanks to @Anghellia for contributing it!

All commits

Feature flux controlnet img2img and inpaint pipeline by @ighoshsubho in #9408
Remove CogVideoX mentions from single file docs; Test updates by @a-r-r-o-w in #9444
set max_shard_size to None for pipeline save_pretrained by @a-r-r-o-w in #9447
adapt masked im2im pipeline for SDXL by @noskill in #7790
[Flux] add lora integration tests. by @sayakpaul in #9353
[training] CogVideoX Lora by @a-r-r-o-w in #9302
Several fixes to Flux ControlNet pipelines by @vladmandic in #9472
[refactor] LoRA tests by @a-r-r-o-w in #9481
[CI] fix nightly model tests by @sayakpaul in #9483
[Cog] some minor fixes and nits by @sayakpaul in #9466
[Tests] Reduce the model size in the lumina test by @saqlain2204 in #8985
Fix the bug of sd3 controlnet training when using gradient checkpointing. by @pibbo88 in #9498
[Schedulers] Add exponential sigmas / exponential noise schedule by @hlky in #9499
Allow DDPMPipeline half precision by @sbinnee in #9222
Add Noise Schedule/Schedule Type to Schedulers Overview documentation by @hlky in #9504
fix bugs for sd3 controlnet training by @xduzhangjiayu in #9489
[Doc] Fix path and and also import imageio by @LukeLIN-web in #9506
[CI] allow faster downloads from the Hub in CI. by @sayakpaul in #9478
a few fix for SingleFile tests by @yiyixuxu in #9522
Add exponential sigmas to other schedulers and update docs by @hlky in #9518
[Community Pipeline] Batched implementation of Flux with CFG by @sayakpaul in #9513
Update community_projects.md by @lee101 in #9266
[docs] Model sharding by @stevhliu in #9521
update get_parameter_dtype by @yiyixuxu in #9526
[Doc] Improved level of clarity for latents_to_rgb. by @LagPixelLOL in #9529
[Schedulers] Add beta sigmas / beta noise schedule by @hlky in #9509
flux controlnet fix (control_modes batch & others) by @yiyixuxu in #9507
[Tests] Fix ChatGLMTokenizer by @asomoza in #9536
[bug] Precedence of operations in VAE should be slicing -> tiling by @a-r-r-o-w in #9342
[LoRA] make set_adapters() method more robust. by @sayakpaul in #9535
[examples] add train flux-controlnet scripts in example. by @PromeAIpro in #9324
[Tests] [LoRA] clean up the serialization stuff. by @sayakpaul in #9512
[Core] fix variant-identification. by @sayakpaul in #9253
[refactor] remove conv_cache from CogVideoX VAE by @a-r-r-o-w in #9524
[train_instruct_pix2pix.py]Fix the LR schedulers when num_train_epochs is passed in a distributed training env by @AnandK27 in #9316
[chore] fix: retain memory utility. by @sayakpaul in #9543
[LoRA] support Kohya Flux LoRAs that have text encoders as well by @sayakpaul in #9542
Add beta sigmas to other schedulers and update docs by @hlky in #9538
Add PAG support to StableDiffusionControlNetPAGInpaintPipeline by @juancopi81 in #8875
Support bfloat16 for Upsample2D by @darhsu in #9480
fix cogvideox autoencoder decode by @Xiang-cd in #9569
[sd3] make sure height and size are divisible by 16 by @yiyixuxu in #9573
fix xlabs FLUX lora conversion typo by @Clement-Lelievre in #9581
[Chore] add a note on the versions in Flux LoRA integration tests by @sayakpaul in #9598
fix vae dtype when accelerate config using --mixed_precision="fp16" by @xduzhangjiayu in #9601
refac: docstrings in import_utils.py by @yijun-lee in #9583
Fix for use_safetensors parameters, allow use of parameter on loading submodels by @elismasilva in #9576)
Update distributed_inference.md to include transformer.device_map by @sayakpaul in #9553
fix: CogVideox train dataset _preprocess_data crop video by @glide-the in #9574
[LoRA] Handle DoRA better by @sayakpau...

Contributors

noskill, pureexe, and 47 other contributors

Assets 2

17 Sep 06:22

a-r-r-o-w

v0.30.3

c9ff360

v0.30.3: CogVideoX Image-to-Video and Video-to-Video

This patch release adds Diffusers support for the upcoming CogVideoX-5B-I2V release (an Image-to-Video generation model)! The model weights will be available by end of the week on the HF Hub at THUDM/CogVideoX-5b-I2V (Link). Stay tuned for the release!

This release features two new pipelines:

CogVideoXImageToVideoPipeline
CogVideoXVideoToVideoPipeline

Additionally, we now have support for tiled encoding in the CogVideoX VAE. This can be enabled by calling the vae.enable_tiling() method, and it is used in the new Video-to-Video pipeline to encode sample videos to latents in a memory-efficient manner.

CogVideoXImageToVideoPipeline

The code below demonstrates how to use the new image-to-video pipeline:

import torch
from diffusers import CogVideoXImageToVideoPipeline
from diffusers.utils import export_to_video, load_image

pipe = CogVideoXImageToVideoPipeline.from_pretrained("THUDM/CogVideoX-5b-I2V", torch_dtype=torch.bfloat16)
pipe.to("cuda")

# Optionally, enable memory optimizations.
# If enabling CPU offloading, remember to remove `pipe.to("cuda")` above
pipe.enable_model_cpu_offload()
pipe.vae.enable_tiling()

prompt = "An astronaut hatching from an egg, on the surface of the moon, the darkness and depth of space realised in the background. High quality, ultrarealistic detail and breath-taking movie-like camera shot."
image = load_image(
    "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/astronaut.jpg"
)
video = pipe(image, prompt, use_dynamic_cfg=True)
export_to_video(video.frames[0], "output.mp4", fps=8)

CogVideoXImageToVideoExample.mp4

CogVideoXVideoToVideoPipeline

The code below demonstrates how to use the new video-to-video pipeline:

import torch
from diffusers import CogVideoXDPMScheduler, CogVideoXVideoToVideoPipeline
from diffusers.utils import export_to_video, load_video

# Models: "THUDM/CogVideoX-2b" or "THUDM/CogVideoX-5b"
pipe = CogVideoXVideoToVideoPipeline.from_pretrained("THUDM/CogVideoX-5b-trial", torch_dtype=torch.bfloat16)
pipe.scheduler = CogVideoXDPMScheduler.from_config(pipe.scheduler.config)
pipe.to("cuda")

input_video = load_video(
    "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/hiker.mp4"
)
prompt = (
    "An astronaut stands triumphantly at the peak of a towering mountain. Panorama of rugged peaks and "
    "valleys. Very futuristic vibe and animated aesthetic. Highlights of purple and golden colors in "
    "the scene. The sky is looks like an animated/cartoonish dream of galaxies, nebulae, stars, planets, "
    "moons, but the remainder of the scene is mostly realistic."
)

video = pipe(
    video=input_video, prompt=prompt, strength=0.8, guidance_scale=6, num_inference_steps=50
).frames[0]
export_to_video(video, "output.mp4", fps=8)

CogVideoXVideoToVideoExample.mp4

Shoutout to @tin2tin for the awesome demonstration!

Refer to our documentation to learn more about it.

All commits

[core] Support VideoToVideo with CogVideoX by @a-r-r-o-w in #9333
[core] CogVideoX memory optimizations in VAE encode by @a-r-r-o-w in #9340
[CI] Quick fix for Cog Video Test by @DN6 in #9373
[refactor] move positional embeddings to patch embed layer for CogVideoX by @a-r-r-o-w in #9263
CogVideoX-5b-I2V support by @zRzRzRzRzRzRzR in #9418

Contributors

tin2tin, DN6, and 2 other contributors

Assets 2

31 Aug 00:23

asomoza

v0.30.2

f63c126

v0.30.2: Update from single file default repository

All commits

update runway repo for single_file by @yiyixuxu in #9323
Fix Flux CLIP prompt embeds repeat for num_images_per_prompt > 1 by @DN6 in #9280
[IP Adapter] Fix cache_dir and local_files_only for image encoder by @asomoza in #9272

Contributors

asomoza, DN6, and yiyixuxu

Assets 2

24 Aug 07:26

yiyixuxu

v0.30.1

8b9bfae

V0.30.1: CogVideoX-5B & Bug fixes

CogVideoX-5B

This patch release adds diffusers support for the upcoming CogVideoX-5B release! The model weights will be available next week on the Huggingface Hub at THUDM/CogVideoX-5b. Stay tuned for the release!

Additionally, we have implemented VAE tiling feature, which reduces the memory requirement for CogVideoX models. With this update, the total memory requirement is now 12GB for CogVideoX-2B and 21GB for CogVideoX-5B (with CPU offloading). To Enable this feature, simply call enable_tiling() on the VAE.

The code below shows how to generate a video with CogVideoX-5B

import torch
from diffusers import CogVideoXPipeline
from diffusers.utils import export_to_video

prompt = "Tracking shot,late afternoon light casting long shadows,a cyclist in athletic gear pedaling down a scenic mountain road,winding path with trees and a lake in the background,invigorating and adventurous atmosphere."

pipe = CogVideoXPipeline.from_pretrained(
    "THUDM/CogVideoX-5b",
    torch_dtype=torch.bfloat16
)

pipe.enable_model_cpu_offload()
pipe.vae.enable_tiling()

video = pipe(
    prompt=prompt,
    num_videos_per_prompt=1,
    num_inference_steps=50,
    num_frames=49,
    guidance_scale=6,
).frames[0]

export_to_video(video, "output.mp4", fps=8)

000000.mp4

Refer to our documentation to learn more about it.

All commits

Update Video Loading/Export to use imageio by @DN6 in #9094
[refactor] CogVideoX followups + tiled decoding support by @a-r-r-o-w in #9150
Add Learned PE selection for Auraflow by @cloneofsimo in #9182
[Single File] Fix configuring scheduler via legacy kwargs by @DN6 in #9229
[Flux LoRA] support parsing alpha from a flux lora state dict. by @sayakpaul in #9236
[tests] fix broken xformers tests by @a-r-r-o-w in #9206
Cogvideox-5B Model adapter change by @zRzRzRzRzRzRzR in #9203
[Single File] Support loading Comfy UI Flux checkpoints by @DN6 in #9243

Contributors

DN6, sayakpaul, and 3 other contributors

Assets 2

07 Aug 07:47

sayakpaul

v0.30.0

8a79d8e

v0.30.0: New Pipelines (Flux, Stable Audio, Kolors, CogVideoX, Latte, and more), New Methods (FreeNoise, SparseCtrl), and New Refactors

New pipelines

Image taken from the Lumina’s GitHub.

This release features many new pipelines. Below, we provide a list:

Audio pipelines 🎼

Stable Audio

Video pipelines 📹

Latte (thanks to @maxin-cn for the contribution through #8404)
CogVideoX (thanks to @zRzRzRzRzRzRzR for the contribution through #9082)

Image pipelines 🎇

Lumina (thanks to @PommesPeter for the contribution through #8652)
Kolors
AuraFlow
Flux

Be sure to check out the respective docs to know more about these pipelines. Some additional pointers are below for curious minds:

Lumina introduces a new DiT architecture that is multilingual in nature.
Kolors is inspired by SDXL and is also multilingual in nature.
Flux introduces the largest (more than 12B parameters!) open-sourced DiT variant available to date. For efficient DreamBooth + LoRA training, we recommend @bghira’s guide here.
We have worked on a guide that shows how to quantize these large pipelines for memory efficiency with optimum.quanto. Check it out here.
CogVideoX introduces a novel and truly 3D VAE into Diffusers.

Perturbed Attention Guidance (PAG)

Without PAG	With PAG

We already had community pipelines for PAG, but given its usefulness, we decided to make it a first-class citizen of the library. We have a central usage guide for PAG here, which should be the entry point for a user interested in understanding and using PAG for their use cases. We currently support the following pipelines with PAG:

StableDiffusionPAGPipeline
StableDiffusion3PAGPipeline
StableDiffusionControlNetPAGPipeline
StableDiffusionXLPAGPipeline
StableDiffusionXLPAGImg2ImgPipeline
StableDiffusionXLPAGInpaintPipeline
StableDiffusionXLControlNetPAGPipeline
StableDiffusion3PAGPipeline
PixArtSigmaPAGPipeline
HunyuanDiTPAGPipeline
AnimateDiffPAGPipeline
KolorsPAGPipeline

If you’re interested in helping us extend our PAG support for other pipelines, please check out this thread.
Special thanks to Ahn Donghoon (@sunovivid), the author of PAG, for helping us with the integration and adding PAG support to SD3.

AnimateDiff with SparseCtrl

SparseCtrl introduces methods of controllability into text-to-video diffusion models leveraging signals such as line/edge sketches, depth maps, and RGB images by incorporating an additional condition encoder, inspired by ControlNet, to process these signals in the AnimateDiff framework. It can be applied to a diverse set of applications such as interpolation or video prediction (filling in the gaps between sequence of images for animation), personalized image animation, sketch-to-video, depth-to-video, and more. It was introduced in SparseCtrl: Adding Sparse Controls to Text-to-Video Diffusion Models.

There are two SparseCtrl-specific checkpoints and a Motion LoRA made available by the authors namely:

Scribble Interpolation Example:

import torch

from diffusers import AnimateDiffSparseControlNetPipeline, AutoencoderKL, MotionAdapter, SparseControlNetModel
from diffusers.schedulers import DPMSolverMultistepScheduler
from diffusers.utils import export_to_gif, load_image

motion_adapter = MotionAdapter.from_pretrained("guoyww/animatediff-motion-adapter-v1-5-3", torch_dtype=torch.float16).to(device)
controlnet = SparseControlNetModel.from_pretrained("guoyww/animatediff-sparsectrl-scribble", torch_dtype=torch.float16).to(device)
vae = AutoencoderKL.from_pretrained("stabilityai/sd-vae-ft-mse", torch_dtype=torch.float16).to(device)
pipe = AnimateDiffSparseControlNetPipeline.from_pretrained(
    "SG161222/Realistic_Vision_V5.1_noVAE",
    motion_adapter=motion_adapter,
    controlnet=controlnet,
    vae=vae,
    scheduler=scheduler,
    torch_dtype=torch.float16,
).to(device)
pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config, beta_schedule="linear", algorithm_type="dpmsolver++", use_karras_sigmas=True)
pipe.load_lora_weights("guoyww/animatediff-motion-lora-v1-5-3", adapter_name="motion_lora")
pipe.fuse_lora(lora_scale=1.0)

prompt = "an aerial view of a cyberpunk city, night time, neon lights, masterpiece, high quality"
negative_prompt = "low quality, worst quality, letterboxed"

image_files = [
    "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/animatediff-scribble-1.png",
    "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/animatediff-scribble-2.png",
    "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/animatediff-scribble-3.png"
]
condition_frame_indices = [0, 8, 15]
conditioning_frames = [load_image(img_file) for img_file in image_files]

video = pipe(
    prompt=prompt,
    negative_prompt=negative_prompt,
    num_inference_steps=25,
    conditioning_frames=conditioning_frames,
    controlnet_conditioning_scale=1.0,
    controlnet_frame_indices=condition_frame_indices,
    generator=torch.Generator().manual_seed(1337),
).frames[0]
export_to_gif(video, "output.gif")

📜 Check out the docs here.

FreeNoise for AnimateDiff

FreeNoise is a training-free method that allows extending the generative capabilities of pretrained video diffusion models beyond their existing context/frame limits.

Instead of initializing noises for all frames, FreeNoise reschedules a sequence of noises for long-range correlation and performs temporal attention over them using a window-based function. We have added FreeNoise to the AnimateDiff family of models in Diffusers, allowing them to generate videos beyond their default 32 frame limit.

import torch
from diffusers import AnimateDiffPipeline, MotionAdapter, EulerAncestralDiscreteScheduler
from diffusers.utils import export_to_gif

adapter = MotionAdapter.from_pretrained("guoyww/animatediff-motion-adapter-v1-5-2", torch_dtype=torch.float16)
pipe = AnimateDiffPipeline.from_pretrained("SG161222/Realistic_Vision_V6.0_B1_noVAE", motion_adapter=adapter, torch_dtype=torch.float16)
pipe.scheduler = EulerAncestralDiscreteScheduler(
    beta_schedule="linear",
    beta_start=0.00085,
    beta_end=0.012,
)

pipe.enable_free_noise()
pipe.vae.enable_slicing()

pipe.enable_model_cpu_offload()
frames = pipe(
    "An astronaut riding a horse on Mars.",
    num_frames=64,
    num_inference_steps=20,
    guidance_scale=7.0,
    decode_chunk_size=2,
).frames[0]

export_to_gif(frames, "freenoise-64.gif")

LoRA refactor

We have significantly refactored the loader classes associated with LoRA. Going forward, this will help in adding LoRA support for new pipelines and models. We now have a LoraBaseMixin class which is subclassed by the different pipeline-level LoRA loading classes such as StableDiffusionXLLoraLoaderMixin. This document provides an overview of the available classes.

Additionally, we have increased the coverage of methods within the PeftAdapterMixin class. This refactoring allows all the supported models to share common LoRA functionalities such set_adapter(), add_adapter(), and so on.

To learn more details, please follow this PR. If you see any LoRA-related iss...

Contributors

catwell, noskill, and 49 other contributors

Assets 2

27 Jun 03:59

sayakpaul

v0.29.2

c586aad

v0.29.2: fix deprecation and LoRA bugs 🐞

All commits

[SD3] Fix mis-matched shape when num_images_per_prompt > 1 using without T5 (text_encoder_3=None) by @Dalanke in #8558
[LoRA] refactor lora conversion utility. by @sayakpaul in #8295
[LoRA] fix conversion utility so that lora dora loads correctly by @sayakpaul in #8688
[Chore] remove deprecation from transformer2d regarding the output class. by @sayakpaul in #8698
[LoRA] fix vanilla fine-tuned lora loading. by @sayakpaul in #8691
Release: v0.29.2 by @sayakpaul (direct commit on v0.29.2-patch)

Contributors

sayakpaul and Dalanke

Assets 2

21 Jun 01:50

yiyixuxu

v0.29.1

a0a5427

v0.29.1: SD3 ControlNet, Expanded SD3 `from_single_file` support, Using long Prompts with T5 Text Encoder & Bug fixes

SD3 CntrolNet

import torch
from diffusers import StableDiffusion3ControlNetPipeline
from diffusers.models import SD3ControlNetModel, SD3MultiControlNetModel
from diffusers.utils import load_image

controlnet = SD3ControlNetModel.from_pretrained("InstantX/SD3-Controlnet-Canny", torch_dtype=torch.float16)

pipe = StableDiffusion3ControlNetPipeline.from_pretrained(
    "stabilityai/stable-diffusion-3-medium-diffusers", controlnet=controlnet, torch_dtype=torch.float16
)
pipe.to("cuda")
control_image = load_image("https://huggingface.co/InstantX/SD3-Controlnet-Canny/resolve/main/canny.jpg")
prompt = "A girl holding a sign that says InstantX"
image = pipe(prompt, control_image=control_image, controlnet_conditioning_scale=0.7).images[0]
image.save("sd3.png")

📜 Refer to the official docs here to learn more about it.

Thanks to @haofanwang @wangqixun from the @ResearcherXman team for contributing this pipeline!

Expanded single file support

We now support all available single-file checkpoints for sd3 in diffusers! To load the single file checkpoint with t5

import torch
from diffusers import StableDiffusion3Pipeline

pipe = StableDiffusion3Pipeline.from_single_file(
    "https://huggingface.co/stabilityai/stable-diffusion-3-medium/blob/main/sd3_medium_incl_clips_t5xxlfp8.safetensors",
    torch_dtype=torch.float16,
)
pipe.enable_model_cpu_offload()

image = pipe("a picture of a cat holding a sign that says hello world").images[0]
image.save('sd3-single-file-t5-fp8.png')

Using Long Prompts with the T5 Text Encoder

We increased the default sequence length for the T5 Text Encoder from a maximum of 77 to 256! It can be adjusted to accept fewer or more tokens by setting the max_sequence_length to a maximum of 512. Keep in mind that longer sequences require additional resources and will result in longer generation times. This effect is particularly noticeable during batch inference.

prompt = "A whimsical and creative image depicting a hybrid creature that is a mix of a waffle and a hippopotamus. This imaginative creature features the distinctive, bulky body of a hippo, but with a texture and appearance resembling a golden-brown, crispy waffle. The creature might have elements like waffle squares across its skin and a syrup-like sheen. It’s set in a surreal environment that playfully combines a natural water habitat of a hippo with elements of a breakfast table setting, possibly including oversized utensils or plates in the background. The image should evoke a sense of playful absurdity and culinary fantasy."

image = pipe(
    prompt=prompt,
    negative_prompt="",
    num_inference_steps=28,
    guidance_scale=4.5,
    max_sequence_length=512,
).images[0]

Before	max_sequence_length=256	max_sequence_length=512

All commits

Release: v0.29.0 by @sayakpaul (direct commit on v0.29.1-patch)
prepare for patch release by @yiyixuxu (direct commit on v0.29.1-patch)
fix warning log for Transformer SD3 by @sayakpaul in #8496
Add SD3 AutoPipeline mappings by @Beinsezii in #8489
Add Hunyuan AutoPipe mapping by @Beinsezii in #8505
Expand Single File support in SD3 Pipeline by @DN6 in #8517
[Single File Loading] Handle unexpected keys in CLIP models when accelerate isn't installed. by @DN6 in #8462
Fix sharding when no device_map is passed by @SunMarc in #8531
[SD3 Inference] T5 Token limit by @asomoza in #8506
Fix gradient checkpointing issue for Stable Diffusion 3 by @Carolinabanana in #8542
Support SD3 ControlNet and Multi-ControlNet. by @wangqixun in #8566
fix from_single_file for checkpoints with t5 by @yiyixuxu in #8631
[SD3] Fix mis-matched shape when num_images_per_prompt > 1 using without T5 (text_encoder_3=None) by @Dalanke in #8558

Significant community contributions

The following contributors have made significant changes to the library over the last release:

@wangqixun
- Support SD3 ControlNet and Multi-ControlNet. (#8566)

Contributors

$@RefractAI$

asomoza, DN6, and 9 other contributors

Assets 2

12 Jun 20:14

sayakpaul

v0.29.0

39aa390

v0.29.0: Stable Diffusion 3

This release emphasizes Stable Diffusion 3, Stability AI’s latest iteration of the Stable Diffusion family of models. It was introduced in Scaling Rectified Flow Transformers for High-Resolution Image Synthesis by Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, Kyle Lacey, Alex Goodwin, Yannik Marek, and Robin Rombach.

As the model is gated, before using it with diffusers, you first need to go to the Stable Diffusion 3 Medium Hugging Face page, fill in the form and accept the gate. Once you are in, you need to log in so that your system knows you’ve accepted the gate.

huggingface-cli login

The code below shows how to perform text-to-image generation with SD3:

import torch
from diffusers import StableDiffusion3Pipeline

pipe = StableDiffusion3Pipeline.from_pretrained("stabilityai/stable-diffusion-3-medium-diffusers", torch_dtype=torch.float16)
pipe = pipe.to("cuda")

image = pipe(
    "A cat holding a sign that says hello world",
    negative_prompt="",
    num_inference_steps=28,
    guidance_scale=7.0,
).images[0]
image

Refer to our documentation for learning all the optimizations you can apply to SD3 as well as the image-to-image pipeline.

Additionally, we support DreamBooth + LoRA fine-tuning of Stable Diffusion 3 through rectified flow. Check out this directory for more details.

Assets 2

04 Jun 21:36

yiyixuxu

v0.28.2

de9528e

v0.28.2: fix `from_single_file` clip model checkpoint key error 🐞

Change checkpoint key used to identify CLIP models in single file checkpoints by @DN6 in #8319

Contributors

DN6

Assets 2

04 Jun 10:37

sayakpaul

v0.28.1

0091f08

v0.28.1: HunyuanDiT and Transformer2D model class variants

This patch release primarily introduces the Hunyuan DiT pipeline from the Tencent team.

Hunyuan DiT

Hunyuan DiT is a transformer-based diffusion pipeline, introduced in the Hunyuan-DiT : A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding paper by the Tencent Hunyuan.

import torch
from diffusers import HunyuanDiTPipeline

pipe = HunyuanDiTPipeline.from_pretrained(
    "Tencent-Hunyuan/HunyuanDiT-Diffusers", torch_dtype=torch.float16
)
pipe.to("cuda")

# You may also use English prompt as HunyuanDiT supports both English and Chinese
# prompt = "An astronaut riding a horse"
prompt = "一个宇航员在骑马"
image = pipe(prompt).images[0]

🧠 This pipeline has support for multi-linguality.

📜 Refer to the official docs here to learn more about it.

Thanks to @gnobitab, for contributing Hunyuan DiT in #8240.

All commits

Release: v0.28.0 by @sayakpaul (direct commit on v0.28.1-patch)
[Core] Introduce class variants for Transformer2DModel by @sayakpaul in #7647
resolve comflicts by @toshas (direct commit on v0.28.1-patch)
Tencent Hunyuan Team: add HunyuanDiT related updates by @gnobitab in #8240
Tencent Hunyuan Team - Updated Doc for HunyuanDiT by @gnobitab in #8383
[Transformer2DModel] Handle norm_type safely while remapping by @sayakpaul in #8370
Release: v0.28.1 by @sayakpaul (direct commit on v0.28.1-patch)

Significant community contributions

The following contributors have made significant changes to the library over the last release:

@gnobitab
- Tencent Hunyuan Team: add HunyuanDiT related updates (#8240)
- Tencent Hunyuan Team - Updated Doc for HunyuanDiT (#8383)

Contributors

gnobitab, toshas, and sayakpaul

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.31.0: Stable Diffusion 3.5 Large, CogView3, Quantization, Training Scripts, and more

Stable Diffusion 3.5 Large

Cogview3-plus

Quantization

Training scripts

Misc

All commits

Contributors

CogVideoXImageToVideoPipeline

CogVideoXVideoToVideoPipeline

All commits

Contributors

All commits

Contributors

CogVideoX-5B

All commits

Contributors

New pipelines

Perturbed Attention Guidance (PAG)

AnimateDiff with SparseCtrl

FreeNoise for AnimateDiff

LoRA refactor

Contributors

All commits

Contributors

SD3 CntrolNet

Expanded single file support

Using Long Prompts with the T5 Text Encoder

All commits

Significant community contributions

Contributors

Contributors

Hunyuan DiT

All commits

Significant community contributions

Contributors

Releases: huggingface/diffusers

v0.31.0

v0.31.0: Stable Diffusion 3.5 Large, CogView3, Quantization, Training Scripts, and more

Stable Diffusion 3.5 Large

Cogview3-plus

Quantization

Training scripts

Misc

All commits

Contributors

v0.30.3: CogVideoX Image-to-Video and Video-to-Video

CogVideoXImageToVideoPipeline

CogVideoXVideoToVideoPipeline

All commits

Contributors

v0.30.2: Update from single file default repository

All commits

Contributors

V0.30.1: CogVideoX-5B & Bug fixes

CogVideoX-5B

All commits

Contributors

v0.30.0: New Pipelines (Flux, Stable Audio, Kolors, CogVideoX, Latte, and more), New Methods (FreeNoise, SparseCtrl), and New Refactors

New pipelines

Perturbed Attention Guidance (PAG)

AnimateDiff with SparseCtrl

FreeNoise for AnimateDiff

LoRA refactor

Contributors

v0.29.2: fix deprecation and LoRA bugs 🐞

All commits

Contributors

v0.29.1: SD3 ControlNet, Expanded SD3 `from_single_file` support, Using long Prompts with T5 Text Encoder & Bug fixes

SD3 CntrolNet

Expanded single file support

Using Long Prompts with the T5 Text Encoder

All commits

Significant community contributions

Contributors

v0.29.0: Stable Diffusion 3

v0.28.2: fix `from_single_file` clip model checkpoint key error 🐞

Contributors

v0.28.1: HunyuanDiT and Transformer2D model class variants

Hunyuan DiT

All commits

Significant community contributions

Contributors