enhance fla support for RWKV6 #44

uniartisan · 2024-08-13T08:39:41Z

This pull request aims at enhance fla support for RWKV6, both speed and perfermance on bf16. Also , enable fla on Intel cards.

FLA ChunkRWKV6 Optimized Implementation

This repository contains an optimized implementation of ChunkRWKV6 using FLA (Flash Attention) techniques. Our goal is to simultaneously improve both accuracy and speed compared to standard CUDA implementations.

Performance Comparison

We've conducted performance tests comparing our FLA BF16 implementation with the standard CUDA BF16 implementation. Here are some key results:

Test Case 1: B=32, T=4096, C=4096, HEAD_SIZE=64

Implementation	Forward Time	Backward Time
CUDA BF16	32.80 ms	148.05 ms
FLA BF16	50.17 ms	162.42 ms

Test Case 2: B=8, T=4096, C=4096, HEAD_SIZE=64

Implementation	Forward Time	Backward Time
CUDA BF16	9.69 ms	46.41 ms
FLA BF16	13.06 ms	40.79 ms

Where:

B: Batch size
T: Token length
C: Hidden layer dimension
HEAD_SIZE: Size of attention heads

Accuracy

We've measured the error ratios compared to FP32 CUDA implementations for various components. Our chunkRWKV6 FLA implementation achieves error levels consistent with CUDA implementations:

y:  0.0020138283862787135
gr: 0.00250389610197927
gk: 0.002499128980485113
gv: 0.0028262425242107
gw: 0.0027358097395330894
gu: 0.001821853127644057

uniartisan · 2024-08-13T08:40:46Z

Please try to squash merge :)

yzhangcs · 2024-08-13T18:52:38Z

@uniartisan Hello, many thanks for these great contributions!
I will make some checks soon.
However, could you restrict the revisions to the RWKV6 chunk only? You've defined many decorators for other purposes that are unrelated to this PR title. I think it would be better to create a separate PR for those changes.
Additionally, please note that there are some formatting errors that are not aligned with PEP8 standards.

yzhangcs · 2024-08-13T18:57:59Z

https://github.com/sustcsonglin/flash-linear-attention/blob/8dea8bdaa14eb1f2a06152691dcd238043811fe6/tests/ops/test_rwkv6.py

This file seems broken

yzhangcs · 2024-08-13T19:41:13Z

Also it is not recommended to truncate the spaces at the end of each line in README file, as they are sometimes used as line breaks.

uniartisan · 2024-08-14T05:15:04Z

Your suggestion makes a lot of sense. Some of these changes were introduced by the edittor. I'll try to first limit the changes to chunkrwkv6 and fix the test

uniartisan · 2024-08-14T05:55:31Z

checkrwkv6.tar.gz
Here are the codes that compare CUDA with FLA.

uniartisan · 2024-08-14T07:18:21Z

Also, this pull request fixed #29
The problem was introduced by bfloat16 when calculating dq and dk. By converting to float32 when necessary and using tf32 as much as possible, and changing the group sequence, the pull request speeds up and achieves the same accuracy as the CUDA implementation (pure fp32 internal).

yzhangcs · 2024-08-14T19:35:23Z

@uniartisan Hi, just make some reviews, could you have a check?

uniartisan · 2024-08-15T05:51:20Z

@uniartisan Hi, just make some reviews, could you have a check?

hi. I can't see any comments, could you tell me where could I have a check?

yzhangcs · 2024-08-15T09:42:08Z

@uniartisan Can you see msgs in your notice box

uniartisan · 2024-08-15T11:37:48Z

Could you give me a review like this? https://github.com/sustcsonglin/flash-linear-attention/pull/44/files/4a3e2bb1d699c7e41ead7adc2f2403fb3e79ceb6

I can't see your msgs :(

yzhangcs · 2024-08-18T19:48:14Z

@uniartisan sure, sorry for my late reply

yzhangcs · 2024-08-18T19:53:44Z

@uniartisan Can you see my updated comments between the lines?

uniartisan · 2024-08-20T11:17:30Z

@uniartisan Can you see my updated comments between the lines?

Sorry, I don't know what's going on. I still cannot see you review comments. Maybe you can directly post them here.😎

uniartisan · 2024-08-26T06:12:01Z

@yzhangcs Hello,
I hope finds you well. I have successfully synchronized all the latest changes to your project. Given your expertise and valuable insights, I was wondering if you could kindly take some time to review these updates at your earliest convenience.
Your feedback is crucial to ensure we're on the right track, and I greatly appreciate your assistance in this matter. ：）

yzhangcs · 2024-08-26T07:06:20Z

@uniartisan Thank you for the update. I'm running your code locally as there is no CI w/ GPUs. Will sync with you recently.

yzhangcs · 2024-08-30T09:02:42Z

@uniartisan Hi, can you authorize this branch to me so that I can make some updates

uniartisan · 2024-09-01T15:25:01Z

Hi, can you authorize this branch to me so that I can make some updates

Of course!!! Sorry for my late reply. I will try it :)

…K)` for each head.

yzhangcs · 2024-09-23T18:07:56Z

@uniartisan Hi, closing this PR as new features are too coupled. @sustcsonglin just pushed some new commits resolving the RWKV6 precision problems. Checkout those for more details. You can create new PRs if sth could be improved.

Again, thank you for your contributions and hard work!

uniartisan force-pushed the enhance branch 5 times, most recently from fb728e7 to b99a7a1 Compare August 14, 2024 05:46

uniartisan force-pushed the enhance branch from b99a7a1 to e9558dc Compare August 14, 2024 06:07

[rwkv6]: enhance fla support for RWKV6, both speed and perfermance

4a3e2bb

uniartisan force-pushed the enhance branch from e9558dc to 4a3e2bb Compare August 14, 2024 06:16

uniartisan added 3 commits August 14, 2024 16:17

[rwkv6]: enhance fla support for RWKV6, both speed and perfermance

90ff491

[general]: adopt to Pytorch>=2.4

8c90ade

update readme

f8d9bae

uniartisan added 2 commits August 23, 2024 01:07

add benchmark

2aaffd4

[rwkv6]: reduce memory usage

bca4356

yzhangcs self-requested a review August 25, 2024 17:22

uniartisan and others added 2 commits August 26, 2024 15:14

[rwkv6]: fix backward if h0 not passed

e36d9b0

rename cumsum ops & remove repeated codes

d58f13b

uniartisan added 2 commits August 26, 2024 15:44

Merge remote-tracking branch 'upstream/main'

0d194d7

Merge branch 'main' into enhance

7c5473e

uniartisan force-pushed the enhance branch 2 times, most recently from 9926634 to 49a8951 Compare August 26, 2024 06:05

fetch to upstream

425b100

uniartisan force-pushed the enhance branch from 49a8951 to 425b100 Compare August 26, 2024 06:08

Merge remote-tracking branch 'upstream/main' into enhance

24a470c

Merge branch 'sustcsonglin:main' into main

4bef9b2

[rwkv6]: change u (torch.Tensor): bonus of shape (H, K) or `(B, H, …

93d56a4

…K)` for each head.

uniartisan force-pushed the enhance branch from f0f69d8 to 93d56a4 Compare September 1, 2024 15:28

uniartisan added 10 commits September 2, 2024 16:19

[general]: fix autocase when torch < 2.4.0

c29ab8e

[rwkv6]: change u (torch.Tensor): bonus of shape (H, K) or `(B, H, …

97bdb6a

…K)` for each head.

[general]: fix autocase when torch < 2.4.0

2f47935

[rwkv6]: fix initial_state in naive

5763fc4

Merge branch 'main' into enhance

81882f7

[rwkv6]: fix gradient error due to num_wraps

a6c8b69

Merge branch 'main' into enhance

512a58b

[rwkv6]: fix gradient error due to num_wraps

479a3c3

Merge branch 'main' into enhance

7a99478

[rwkv6]: add ops tests

7cce683

uniartisan force-pushed the enhance branch from 58f208b to 7cce683 Compare September 4, 2024 16:55

yzhangcs closed this Sep 23, 2024

uniartisan deleted the enhance branch September 27, 2024 08:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

enhance fla support for RWKV6 #44

enhance fla support for RWKV6 #44

uniartisan commented Aug 13, 2024

uniartisan commented Aug 13, 2024

yzhangcs commented Aug 13, 2024

yzhangcs commented Aug 13, 2024

yzhangcs commented Aug 13, 2024

uniartisan commented Aug 14, 2024 •

edited

Loading

uniartisan commented Aug 14, 2024

uniartisan commented Aug 14, 2024

yzhangcs commented Aug 14, 2024

uniartisan commented Aug 15, 2024

yzhangcs commented Aug 15, 2024

uniartisan commented Aug 15, 2024

yzhangcs commented Aug 18, 2024

yzhangcs commented Aug 18, 2024

uniartisan commented Aug 20, 2024

uniartisan commented Aug 26, 2024

yzhangcs commented Aug 26, 2024

yzhangcs commented Aug 30, 2024

uniartisan commented Sep 1, 2024

yzhangcs commented Sep 23, 2024

enhance fla support for RWKV6 #44

enhance fla support for RWKV6 #44

Conversation

uniartisan commented Aug 13, 2024

FLA ChunkRWKV6 Optimized Implementation

Performance Comparison

Test Case 1: B=32, T=4096, C=4096, HEAD_SIZE=64

Test Case 2: B=8, T=4096, C=4096, HEAD_SIZE=64

Accuracy

uniartisan commented Aug 13, 2024

yzhangcs commented Aug 13, 2024

yzhangcs commented Aug 13, 2024

yzhangcs commented Aug 13, 2024

uniartisan commented Aug 14, 2024 • edited Loading

uniartisan commented Aug 14, 2024

uniartisan commented Aug 14, 2024

yzhangcs commented Aug 14, 2024

uniartisan commented Aug 15, 2024

yzhangcs commented Aug 15, 2024

uniartisan commented Aug 15, 2024

yzhangcs commented Aug 18, 2024

yzhangcs commented Aug 18, 2024

uniartisan commented Aug 20, 2024

uniartisan commented Aug 26, 2024

yzhangcs commented Aug 26, 2024

yzhangcs commented Aug 30, 2024

uniartisan commented Sep 1, 2024

yzhangcs commented Sep 23, 2024

uniartisan commented Aug 14, 2024 •

edited

Loading