Torch compiled FLCE is 2x faster than the current FLCE #227

ByronHsu · 2024-09-07T06:39:43Z

🚀 The feature, motivation and pitch

We can leverage torch compile to fuse the things we cannot fuse now like upcasting, contiguous call, etc

Sample code: https://gist.github.com/Chillee/22cd93e11b887db1f596ab754d60a899#file-lce_benchmark-py
Provided by the brilliant @Chillee

Alternatives

No response

Additional context

No response

wizyoung · 2024-09-10T04:09:46Z

Actually, if we align the CHUNK_SIZE of the Torch-compiled FLCE with the strategy used in Liger's FLCE, the compiled version is only slightly faster than the Liger version, but it does require a bit more memory as well. The advantage of the Torch-compiled version is its flexibility; implementing the Gemma2 softcap logits is very straightforward, whereas I struggled for some time to achieve consistent accuracy with this in Liger.

wizyoung · 2024-09-10T05:10:20Z

By setting CHUNK_SIZE and run benchmark script:

Chillee · 2024-09-10T05:16:12Z

@wizyoung how are you setting the chunk size? I wasn't able to get the liger kernel to perform much better even when changing the chunk size.

wizyoung · 2024-09-10T05:43:05Z

@Chillee By referencing https://github.com/linkedin/Liger-Kernel/blob/main/src/liger_kernel/ops/fused_linear_cross_entropy.py#L23. I mean, chaning chunk size in torch compiled FLCE. Your default chunk size is 1024, and I change to 256. Then I have:

By only keep benchmark test with liger and compiled chunkce:

env: torch2.3.1, triton2.3.1, A100 80G, cuda12.3

wizyoung · 2024-09-10T05:46:32Z

I have done some quick tests with different B, T, D and V to mimic my training conditions(llama3 and gemma2) in my env, my conclusion is that torch compiled flce is indeed faster, but has worse memory management.

wizyoung · 2024-09-10T14:23:02Z

https://gist.github.com/wizyoung/5330ad501e73a97dfe2f0088decdb1ca
I have implemented a version of torch.compile chunked_lce that supports soft caps and passes all numerical accuracy tests in benchmark_fused_linear_cross_entropy.py modified from this repo. My main concern is the frequent changes in input shape, which result in varying chunk sizes. To mitigate this overhead, I used torch.compile(dynamic=True, options={"shape_padding": True}). However, I am still uncertain about its effectiveness and look into it during actual training.

Chillee · 2024-09-10T19:45:54Z

@wizyoung I agree there's some additional memory overhead (in particular, I think we don't inplace the addmm), but the additional memory is generally pretty negligible here, no?

For example, if I change the chunk size from 256 to 512, torch.compile performance improves from 186ms down to 153, while memory only increases from 1.48 GB to 1.54 GB.

If I try increasing the chunk size of Liger, it doesn't seem to increase the performance as much as the torch.compile version

ekojsalim · 2024-09-10T22:43:46Z

Curious how this compares with JonasGeiping/linear_cross_entropy_loss , but torch.compile seems good enough though.

Chillee · 2024-09-11T00:05:52Z

@ekojsalim In my brief testing, it seems like it's both faster and uses less memory.

wizyoung · 2024-09-11T02:52:11Z

@wizyoung I agree there's some additional memory overhead (in particular, I think we don't inplace the addmm), but the additional memory is generally pretty negligible here, no?

For example, if I change the chunk size from 256 to 512, torch.compile performance improves from 186ms down to 153, while memory only increases from 1.48 GB to 1.54 GB.

If I try increasing the chunk size of Liger, it doesn't seem to increase the performance as much as the torch.compile version

Yes, the increase in memory usage is generally negligible. My primary concern is the running time overhead, specifically that the B*T varies significantly and is not a multiple of the chunk size, leading to frequent calls of recompile(I add TORCH_LOGS="recompiles" to find that). Therefore, I use torch.compile(dynamic=True, options={"shape_padding": True}) as documented; however, I am uncertain about its actual effectiveness.
I did an expensive benchmark test using the script of this repo by setting BT = [2**12] + (np.random.randint(2**12, 2**15 + 1, 80) + np.random.randint(0, 1001, 80)).tolist() + [2**15] and keeping H=4096 and V=128256. And chunk_size is 1024.

Chillee · 2024-09-11T04:14:46Z

@wizyoung Can you post your benchmark script?

wizyoung · 2024-09-11T05:23:38Z

@Chillee I have updated my scripts here: https://gist.github.com/wizyoung/5330ad501e73a97dfe2f0088decdb1ca

## Summary  Adds chunked ORPO loss kernel  ## Testing Done  Benchmarks ![Speed ORPO](https://github.com/user-attachments/assets/ae9e6f67-14cd-4189-9d64-9a2f94a3b3c6) ![Mem ORPO](https://github.com/user-attachments/assets/47c289f4-2876-4530-949c-2c2825bc0f79) References: 1. #227 2. https://gist.github.com/Chillee/22cd93e11b887db1f596ab754d60a899#file-lce_benchmark-py  - Hardware Type: <BLANK> - [x] run `make test` to ensure correctness - [x] run `make checkstyle` to ensure code style - [x] run `make test-convergence` to ensure convergence --------- Co-authored-by: shisahni_LinkedIn <shisahni@linkedin.com>

Chillee mentioned this issue Sep 12, 2024

Inplace addmm within Inductor pytorch/pytorch#135089

Open

This was referenced Nov 8, 2024

Add Chunked ORPO Loss #362

Merged

[RFC] Liger FlexChunkLoss: Alignment and Distillation loss #371

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Torch compiled FLCE is 2x faster than the current FLCE #227

Torch compiled FLCE is 2x faster than the current FLCE #227

ByronHsu commented Sep 7, 2024

wizyoung commented Sep 10, 2024

wizyoung commented Sep 10, 2024

Chillee commented Sep 10, 2024

wizyoung commented Sep 10, 2024 •

edited

Loading

wizyoung commented Sep 10, 2024

wizyoung commented Sep 10, 2024

Chillee commented Sep 10, 2024

ekojsalim commented Sep 10, 2024

Chillee commented Sep 11, 2024

wizyoung commented Sep 11, 2024

Chillee commented Sep 11, 2024 •

edited

Loading

wizyoung commented Sep 11, 2024

Torch compiled FLCE is 2x faster than the current FLCE #227

Torch compiled FLCE is 2x faster than the current FLCE #227

Comments

ByronHsu commented Sep 7, 2024

🚀 The feature, motivation and pitch

Alternatives

Additional context

wizyoung commented Sep 10, 2024

wizyoung commented Sep 10, 2024

Chillee commented Sep 10, 2024

wizyoung commented Sep 10, 2024 • edited Loading

wizyoung commented Sep 10, 2024

wizyoung commented Sep 10, 2024

Chillee commented Sep 10, 2024

ekojsalim commented Sep 10, 2024

Chillee commented Sep 11, 2024

wizyoung commented Sep 11, 2024

Chillee commented Sep 11, 2024 • edited Loading

wizyoung commented Sep 11, 2024

wizyoung commented Sep 10, 2024 •

edited

Loading

Chillee commented Sep 11, 2024 •

edited

Loading