RWKV-CUDA

The CUDA version of the RWKV language model ( https://github.com/BlinkDL/RWKV-LM )

Towards RWKV-4 (see the wkv folder)

I have a basic RWKV-4 kernel in the wkv folder. Let's optimize it.

The formula:

w.shape = (C, T)
k.shape = (B, C, T)
out.shape = (B, C, T)
out[b][c][t] = sum_u{ w[c][(T-1)-(t-u)] * k[b][c][u] }

pytorch = fwd 94ms bwd 529ms

CUDA kernel v0 = fwd 45ms bwd 84ms (simple)

CUDA kernel v1 = fwd 17ms bwd 43ms (shared memory)

CUDA kernel v2 = fwd 13ms bwd 31ms (float4)

CUDA kernel v3 = fwd 3.4ms bwd 23ms (B-group)

More test on RTX3090:

pytorch = fwd 14ms bwd 65ms

CUDA kernel v3 = fwd 0.8ms bwd 5.5ms

How to use: python run.py and it will compile everything for you (pip install Ninja if you don't have it).

Name		Name	Last commit message	Last commit date
Latest commit History 143 Commits
depthwise_conv1d		depthwise_conv1d
wkv		wkv
wkv5		wkv5
wkv5_bf16		wkv5_bf16
wkv5a		wkv5a
wkv6		wkv6
wkv6_state		wkv6_state
.gitignore		.gitignore
README.md		README.md