使用FlashAttention加速Chinese-CLIP

Chinese-CLIP训练现已支持通过FlashAttention加速训练进程。

环境准备

Turing、Ampere、Ada、Hopper架构的Nvidia GPU显卡（如H100、A100、RTX 3090、T4、RTX 2080），Nvidia各架构对应显卡型号可参见此文档表格。
CUDA 11.4及以上版本。
Pytorch 1.12及以上版本。
FlashAttention：通过执行pip install flash-attn安装FlashAttention。

在Chinese-CLIP中用起来！

在Chinese-CLIP finetune中应用FlashAttention非常简单，只需要在finetune的sh脚本中加入--use-flash-attention配置项即可。我们提供了样例脚本run_scripts/muge_finetune_vit-b-16_rbt-base_flashattn.sh。

训练速度和显存占用对比

启用FlashAttention可在不影响效果的条件下为Chinese-CLIP的finetune过程显著提速以及降低显存占用。我们的实验在一台8卡A100 GPU（80GB显存）机器进行，FlashAttention 0.2.8，Pytorch 1.10.1。

我们分别列出finetune过程中，相同batch size下启用FlashAttention前后每个规模模型的FP16精度finetune的batch time和显存占用对比，可以看到启用FlashAttention后，训练速度有所提升，也更加节约显存。对于更大规模模型的训练速度提升和显存占用降低更为显著。

	Batch Time
单位: 秒/it	Batch size	w/o FlashAttention	w/ FlashAttention	Speedup
CN-CLIP_RN50	1200*8	1.710	1.680	1.02×
CN-CLIP_ViT-B/16	450*8	1.477	0.960	1.54×
CN-CLIP_ViT-L/14	128*8	1.293	0.785	1.65×
CN-CLIP_{ViT-L/14@336px}	40*8	1.397	0.587	2.38×
CN-CLIP_ViT-H/14	64*8	1.265	0.845	1.50×

	显存
单位: GB	Batch size	w/o FlashAttention	w/ FlashAttention
CN-CLIP_RN50	1200*8	79	75
CN-CLIP_ViT-B/16	450*8	80	56
CN-CLIP_ViT-L/14	128*8	77	50
CN-CLIP_{ViT-L/14@336px}	40*8	78	37
CN-CLIP_ViT-H/14	64*8	76	57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

flash_attention.md

flash_attention.md

使用FlashAttention加速Chinese-CLIP

环境准备

在Chinese-CLIP中用起来！

训练速度和显存占用对比

Files

flash_attention.md

Latest commit

History

flash_attention.md

File metadata and controls

使用FlashAttention加速Chinese-CLIP

环境准备

在Chinese-CLIP中用起来！

训练速度和显存占用对比