Releases: chengzeyi/ParaAttention
Releases · chengzeyi/ParaAttention
v0.3.3
Nightly Release 20241126
TODO: Add nightly release notes
v0.3.2
🚀Support Multi-GPU Parallel Inference Speedup for CogVideoX
Everything works out of the box!
import torch
import torch.distributed as dist
from diffusers import CogVideoXPipeline
from diffusers.utils import export_to_video
dist.init_process_group()
pipe = CogVideoXPipeline.from_pretrained(
"THUDM/CogVideoX-5b",
torch_dtype=torch.bfloat16,
).to(f"cuda:{dist.get_rank()}")
# pipe.enable_model_cpu_offload()
# pipe.enable_sequential_cpu_offload()
pipe.vae.enable_slicing()
pipe.vae.enable_tiling()
from para_attn.context_parallel import init_context_parallel_mesh
from para_attn.context_parallel.diffusers_adapters import parallelize_pipe
parallelize_pipe(
pipe,
mesh=init_context_parallel_mesh(
pipe.device.type,
max_batch_dim_size=2,
max_ring_dim_size=2,
),
)
torch._inductor.config.reorder_for_compute_comm_overlap = True
pipe.transformer = torch.compile(pipe.transformer, mode="max-autotune-no-cudagraphs")
prompt = "A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical atmosphere of this unique musical performance."
video = pipe(
prompt=prompt,
num_videos_per_prompt=1,
num_inference_steps=50,
num_frames=49,
guidance_scale=6,
# generator=torch.Generator(device=pipe.device).manual_seed(42),
).frames[0]
if dist.get_rank() == 0:
print("Saving video to cogvideox.mp4")
export_to_video(video, "cogvideox.mp4", fps=8)
dist.destroy_process_group()
v0.3.1
fix causal ulysses attn
v0.3.0
Merge pull request #4 from chengzeyi/dev_batch_parallel Dev batch parallel
v0.2.0
Provide an easy to use interface to speed up model inference with context parallel and torch.compile. Make FLUX and Mochi inference much faster losslessly.
Performance
Model | GPU | Method | Wall Time (s) | Speedup |
---|---|---|---|---|
FLUX.1-dev | A100-SXM4-80GB | Baseline | 13.843 | 1.00x |
FLUX.1-dev | A100-SXM4-80GB | torch.compile |
9.997 | 1.38x |
FLUX.1-dev | A100-SXM4-80GB x 2 | para-attn (ulysses) |
8.379 | 1.65x |
FLUX.1-dev | A100-SXM4-80GB x 2 | para-attn (ring) |
8.307 | 1.66x |
FLUX.1-dev | A100-SXM4-80GB x 2 | para-attn (ulysses) + torch.compile |
5.915 | 2.34x |
FLUX.1-dev | A100-SXM4-80GB x 2 | para-attn (ring) + torch.compile |
5.775 | 2.39x |
FLUX.1-dev | A100-SXM4-80GB x 4 | para-attn (ulysses + ring) + torch.compile |
? | ? |
mochi-1-preview | A100-SXM4-80GB | Baseline | 196.534 | 1.00x |
mochi-1-preview | A100-SXM4-80GB | torch.compile |
149.868 | 1.31x |
mochi-1-preview | A100-SXM4-80GB x 2 | para-attn (ulysses) |
110.146 | 1.78x |
mochi-1-preview | A100-SXM4-80GB x 2 | para-attn (ring) |
109.435 | 1.80x |
mochi-1-preview | A100-SXM4-80GB x 2 | para-attn (ulysses) + torch.compile |
83.912 | 2.34x |
mochi-1-preview | A100-SXM4-80GB x 2 | para-attn (ring) + torch.compile |
82.176 | 2.39x |
mochi-1-preview | A100-SXM4-80GB x 4 | para-attn (ulysses + ring) + torch.compile |
? | ? |
v0.1.0
update python-publish.yml