Skip to content

Releases: chengzeyi/ParaAttention

v0.3.3

26 Nov 09:12
Compare
Choose a tag to compare
set output_type to latent for all ranks except rank 0

Nightly Release 20241126

26 Nov 11:18
Compare
Choose a tag to compare
Pre-release

TODO: Add nightly release notes

v0.3.2

19 Nov 15:42
6668e18
Compare
Choose a tag to compare

🚀Support Multi-GPU Parallel Inference Speedup for CogVideoX

Everything works out of the box!

import torch
import torch.distributed as dist
from diffusers import CogVideoXPipeline
from diffusers.utils import export_to_video

dist.init_process_group()

pipe = CogVideoXPipeline.from_pretrained(
    "THUDM/CogVideoX-5b",
    torch_dtype=torch.bfloat16,
).to(f"cuda:{dist.get_rank()}")

# pipe.enable_model_cpu_offload()
# pipe.enable_sequential_cpu_offload()
pipe.vae.enable_slicing()
pipe.vae.enable_tiling()

from para_attn.context_parallel import init_context_parallel_mesh
from para_attn.context_parallel.diffusers_adapters import parallelize_pipe

parallelize_pipe(
    pipe,
    mesh=init_context_parallel_mesh(
        pipe.device.type,
        max_batch_dim_size=2,
        max_ring_dim_size=2,
    ),
)

torch._inductor.config.reorder_for_compute_comm_overlap = True
pipe.transformer = torch.compile(pipe.transformer, mode="max-autotune-no-cudagraphs")

prompt = "A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical atmosphere of this unique musical performance."
video = pipe(
    prompt=prompt,
    num_videos_per_prompt=1,
    num_inference_steps=50,
    num_frames=49,
    guidance_scale=6,
    # generator=torch.Generator(device=pipe.device).manual_seed(42),
).frames[0]

if dist.get_rank() == 0:
    print("Saving video to cogvideox.mp4")
    export_to_video(video, "cogvideox.mp4", fps=8)

dist.destroy_process_group()

v0.3.1

16 Nov 03:43
Compare
Choose a tag to compare
fix causal ulysses attn

v0.3.0

14 Nov 11:08
ffb4f05
Compare
Choose a tag to compare
Merge pull request #4 from chengzeyi/dev_batch_parallel

Dev batch parallel

v0.2.0

13 Nov 11:26
17a3088
Compare
Choose a tag to compare

Provide an easy to use interface to speed up model inference with context parallel and torch.compile. Make FLUX and Mochi inference much faster losslessly.

Performance

Model GPU Method Wall Time (s) Speedup
FLUX.1-dev A100-SXM4-80GB Baseline 13.843 1.00x
FLUX.1-dev A100-SXM4-80GB torch.compile 9.997 1.38x
FLUX.1-dev A100-SXM4-80GB x 2 para-attn (ulysses) 8.379 1.65x
FLUX.1-dev A100-SXM4-80GB x 2 para-attn (ring) 8.307 1.66x
FLUX.1-dev A100-SXM4-80GB x 2 para-attn (ulysses) + torch.compile 5.915 2.34x
FLUX.1-dev A100-SXM4-80GB x 2 para-attn (ring) + torch.compile 5.775 2.39x
FLUX.1-dev A100-SXM4-80GB x 4 para-attn (ulysses + ring) + torch.compile ? ?
mochi-1-preview A100-SXM4-80GB Baseline 196.534 1.00x
mochi-1-preview A100-SXM4-80GB torch.compile 149.868 1.31x
mochi-1-preview A100-SXM4-80GB x 2 para-attn (ulysses) 110.146 1.78x
mochi-1-preview A100-SXM4-80GB x 2 para-attn (ring) 109.435 1.80x
mochi-1-preview A100-SXM4-80GB x 2 para-attn (ulysses) + torch.compile 83.912 2.34x
mochi-1-preview A100-SXM4-80GB x 2 para-attn (ring) + torch.compile 82.176 2.39x
mochi-1-preview A100-SXM4-80GB x 4 para-attn (ulysses + ring) + torch.compile ? ?

v0.1.0

06 Nov 06:09
Compare
Choose a tag to compare
update python-publish.yml