sub-quadratic attention #1

Birch-san · 2022-12-26T22:08:19Z

Implementation of:

Self-attention Does Not Need O(n^2) Memory:
https://arxiv.org/abs/2112.05682v2

Based on Amin Rezaei's implementation:
https://github.com/AminRezaei0x443/memory-efficient-attention

With:

substantial rewrite to optimize it for 3D tensors in [batch * num_heads, tokens, channels_per_head] format
use batched matmuls
fuse multiplies into matmuls
simplifications to utility functions (to make more use of pytorch idioms).
typings

HuggingFaceDocBuilderDev · 2022-12-26T22:13:06Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

isamu-isozaki · 2022-12-27T01:51:26Z

Nice!

…com/AminRezaei0x443/memory-efficient-attention.

…t results.

…hannels_per_head] in order to make use of batched matmuls. fuse multiply into matmul. breaks bias, mask in exchange for massive speedup.

…ghts_calc_fn, calc_fn_data) and unused vars

…ul for SD 2.1. but remove value float32, having established that it works without.

…pository

…them

…to prefer fast-path whenever unchunked attention would fit into memory. add kv_chunk_size_min to control the kv_chunk_size=None behaviour, so that sqrt(key_tokens) does not pick too small of a chunk size

…but not kv)

…of chunk key size. improve separation of concerns.

…al kv_chunk_size: they can notice when no chunking would happen at all, and use fast-path. note: there's a question of whether that concern belongs *inside* the algorithm. but it'd feel weird for chunked attention to have a no-chunking-at-all branch.

… equivalent fast-path for 1 query chunk, 1 kv chunk is already supported inside

…ything in one chunk, to re-use an existing fast-path.

…ose during the matmul

Birch-san · 2023-01-03T01:00:06Z

src/diffusers/utils/dynamic_slice.py

+    starts: List[int],
+    sizes: List[int],
+) -> Tensor:
+    slicing = [slice(start, start + size) for start, size in zip(starts, sizes)]


this attempts to implement jax.lax.dynamic_slice(), but hey is this literally just torch.narrow()?

Yeah that works also:
brkirch/stable-diffusion-webui@b119815

No notable performance difference that I observed, but it's probably slightly more efficient nonetheless.

brkirch · 2023-01-19T10:42:11Z

src/diffusers/models/sub_quadratic_attention.py

+    scale: float,
+) -> AttnChunk:
+    attn_weights = torch.baddbmm(
+        torch.empty(1, 1, 1, device=query.device, dtype=query.dtype),


Shouldn't torch.zeros() be used here instead of torch.empty()?

nope; it's actually an unused tensor (because beta=0), so we want whatever's the cheapest thing that passes the parameter validation. unfortunately PyTorch complains if you pass None. bad API design.

brkirch mentioned this pull request Dec 27, 2022

Add Birch-san's sub-quadratic attention implementation AUTOMATIC1111/stable-diffusion-webui#6055

Merged

Birch-san mentioned this pull request Dec 27, 2022

Improve performance via batched-matmul and fused multiplies AminRezaei0x443/memory-efficient-attention#7

Open

Birch-san added 26 commits December 30, 2022 16:57

initial commit of sub-quadratic attention source from https://github.…

c810c32

…com/AminRezaei0x443/memory-efficient-attention.

invoke efficient_dot_product_attention(). not currently giving correc…

c9b3b9f

…t results.

provide a way to skip checkpointing

70dc50d

MPS fixes; now working

c794f0b

eliminate all einsums. assume 3D tensor [batch * num_heads, tokens, c…

04a5cbe

…hannels_per_head] in order to make use of batched matmuls. fuse multiply into matmul. breaks bias, mask in exchange for massive speedup.

remove the bits that I broke in the pursuit of speed (mask, bias, wei…

b44fa12

…ghts_calc_fn, calc_fn_data) and unused vars

clarify comment; verified that upcast_attention is indeed still helpf…

8694703

…ul for SD 2.1. but remove value float32, having established that it works without.

add TODO about softmax

5bfe96d

typings

da8901b

simplify protocols

0c4d82f

remove unused

c5e8e31

simplify protocol

b16edc9

fix tensor shape destructuring

b7fc3a8

simplify dynamic_slice

8f003c2

simplify chunk scanning

1334670

inline sole use of map_pt function

0676c13

simplify

264dfb7

no longer using original utilities from memory-efficient-attention re…

205f55b

…pository

fix query slicing

1880c0e

fix kv chunking

8603c30

simplify dynamic slicing

96e0d8c

removed bias, mask, weights, calc_fn, and the conditions controlling …

63ca66d

…them

device arg fix no longer included

f4c0bf4

simplify

624123f

clarify attributions now that algorithm has been substantially rewritten

5b92dab

add chunk_threshold_bytes to let you specify your safe memory limit, …

60f0a5e

…to prefer fast-path whenever unchunked attention would fit into memory. add kv_chunk_size_min to control the kv_chunk_size=None behaviour, so that sqrt(key_tokens) does not pick too small of a chunk size

Birch-san added 8 commits December 30, 2022 16:57

fast path for when we're just attention-slicing (i.e. chunking query …

48db711

…but not kv)

default kv_chunk_size was meant to be sqrt() of global key size, not …

ef20fb9

…of chunk key size. improve separation of concerns.

remove debug notes

69a8d2e

explain kv fast-path

db25934

add fast-path for "1 query chunk"

7aa8bac

Revert "move kv_chunk_size_min concern to callsite (1c4f107)" because…

a3152d8

… equivalent fast-path for 1 query chunk, 1 kv chunk is already supported inside

de-duplicate fast-path for "matmul < quota". we can just ask for ever…

0eafb95

…ything in one chunk, to re-use an existing fast-path.

Birch-san force-pushed the subquad_attn branch from 84bf1c0 to 0eafb95 Compare December 30, 2022 16:57

pre-transpose key, rather than transposing it then undoing the transp…

9dc6822

…ose during the matmul

Birch-san force-pushed the subquad_attn branch from 3c92600 to 9dc6822 Compare December 30, 2022 17:15

Birch-san mentioned this pull request Jan 3, 2023

Memory-efficient attention (without xformers) huggingface/diffusers#1892

Open

Birch-san commented Jan 3, 2023

View reviewed changes

brkirch reviewed Jan 19, 2023

View reviewed changes

Beinsezii added a commit to Beinsezii/diffusers that referenced this pull request Feb 28, 2024

Port Birch-san#1

957891c

Beinsezii added a commit to Beinsezii/diffusers that referenced this pull request Feb 28, 2024

Port Birch-san#1

e5c2e3d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sub-quadratic attention #1

sub-quadratic attention #1

Birch-san commented Dec 26, 2022 •

edited

Loading

HuggingFaceDocBuilderDev commented Dec 26, 2022

isamu-isozaki commented Dec 27, 2022

Birch-san Jan 3, 2023

brkirch Jan 5, 2023 •

edited

Loading

brkirch Jan 19, 2023

Birch-san Jan 19, 2023

sub-quadratic attention #1

Are you sure you want to change the base?

sub-quadratic attention #1

Conversation

Birch-san commented Dec 26, 2022 • edited Loading

HuggingFaceDocBuilderDev commented Dec 26, 2022

isamu-isozaki commented Dec 27, 2022

Birch-san Jan 3, 2023

Choose a reason for hiding this comment

brkirch Jan 5, 2023 • edited Loading

Choose a reason for hiding this comment

brkirch Jan 19, 2023

Choose a reason for hiding this comment

Birch-san Jan 19, 2023

Choose a reason for hiding this comment

Birch-san commented Dec 26, 2022 •

edited

Loading

brkirch Jan 5, 2023 •

edited

Loading