26 Nov 09:12

v0.3.3

f529df3

v0.3.3 Latest

Latest

set output_type to latent for all ranks except rank 0

Assets 4

26 Nov 11:18

github-actions

nightly

16ab3b1

Nightly Release 20241126 Pre-release

Pre-release

TODO: Add nightly release notes

Assets 4

19 Nov 15:42

github-actions

v0.3.2

6668e18

v0.3.2

🚀Support Multi-GPU Parallel Inference Speedup for CogVideoX

Everything works out of the box!

import torch
import torch.distributed as dist
from diffusers import CogVideoXPipeline
from diffusers.utils import export_to_video

dist.init_process_group()

pipe = CogVideoXPipeline.from_pretrained(
    "THUDM/CogVideoX-5b",
    torch_dtype=torch.bfloat16,
).to(f"cuda:{dist.get_rank()}")

# pipe.enable_model_cpu_offload()
# pipe.enable_sequential_cpu_offload()
pipe.vae.enable_slicing()
pipe.vae.enable_tiling()

from para_attn.context_parallel import init_context_parallel_mesh
from para_attn.context_parallel.diffusers_adapters import parallelize_pipe

parallelize_pipe(
    pipe,
    mesh=init_context_parallel_mesh(
        pipe.device.type,
        max_batch_dim_size=2,
        max_ring_dim_size=2,
    ),
)

torch._inductor.config.reorder_for_compute_comm_overlap = True
pipe.transformer = torch.compile(pipe.transformer, mode="max-autotune-no-cudagraphs")

prompt = "A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical atmosphere of this unique musical performance."
video = pipe(
    prompt=prompt,
    num_videos_per_prompt=1,
    num_inference_steps=50,
    num_frames=49,
    guidance_scale=6,
    # generator=torch.Generator(device=pipe.device).manual_seed(42),
).frames[0]

if dist.get_rank() == 0:
    print("Saving video to cogvideox.mp4")
    export_to_video(video, "cogvideox.mp4", fps=8)

dist.destroy_process_group()

Assets 4

16 Nov 03:43

github-actions

v0.3.1

b92e7e8

v0.3.1

fix causal ulysses attn

Assets 4

14 Nov 11:08

github-actions

v0.3.0

ffb4f05

v0.3.0

Merge pull request #4 from chengzeyi/dev_batch_parallel

Dev batch parallel

Assets 4

13 Nov 11:26

github-actions

v0.2.0

17a3088

v0.2.0

Provide an easy to use interface to speed up model inference with context parallel and torch.compile. Make FLUX and Mochi inference much faster losslessly.

Performance

Model	GPU	Method	Wall Time (s)	Speedup
FLUX.1-dev	A100-SXM4-80GB	Baseline	13.843	1.00x
FLUX.1-dev	A100-SXM4-80GB	`torch.compile`	9.997	1.38x
FLUX.1-dev	A100-SXM4-80GB x 2	`para-attn (ulysses)`	8.379	1.65x
FLUX.1-dev	A100-SXM4-80GB x 2	`para-attn (ring)`	8.307	1.66x
FLUX.1-dev	A100-SXM4-80GB x 2	`para-attn (ulysses)` + `torch.compile`	5.915	2.34x
FLUX.1-dev	A100-SXM4-80GB x 2	`para-attn (ring)` + `torch.compile`	5.775	2.39x
FLUX.1-dev	A100-SXM4-80GB x 4	`para-attn (ulysses + ring)` + `torch.compile`	?	?
mochi-1-preview	A100-SXM4-80GB	Baseline	196.534	1.00x
mochi-1-preview	A100-SXM4-80GB	`torch.compile`	149.868	1.31x
mochi-1-preview	A100-SXM4-80GB x 2	`para-attn (ulysses)`	110.146	1.78x
mochi-1-preview	A100-SXM4-80GB x 2	`para-attn (ring)`	109.435	1.80x
mochi-1-preview	A100-SXM4-80GB x 2	`para-attn (ulysses)` + `torch.compile`	83.912	2.34x
mochi-1-preview	A100-SXM4-80GB x 2	`para-attn (ring)` + `torch.compile`	82.176	2.39x
mochi-1-preview	A100-SXM4-80GB x 4	`para-attn (ulysses + ring)` + `torch.compile`	?	?

Assets 4

06 Nov 06:09

github-actions

v0.1.0

b7665f5

v0.1.0

update python-publish.yml

Assets 4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🚀Support Multi-GPU Parallel Inference Speedup for CogVideoX

Performance

Releases: chengzeyi/ParaAttention

v0.3.3

Nightly Release 20241126

v0.3.2

🚀Support Multi-GPU Parallel Inference Speedup for CogVideoX

v0.3.1

v0.3.0

v0.2.0

Performance

v0.1.0