How to specify to use flash-atten instead of flash-atten2 (GPU is v100) #2784

huangwei907781034 · 2024-11-21T03:13:18Z

huangwei907781034
Nov 21, 2024

nvidia-smi:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.73.08    Driver Version: 510.73.08    CUDA Version: 12.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-PCIE...  On   | 00000000:D8:00.0 Off |                    0 |
| N/A   39C    P0    40W / 250W |   5968MiB / 16384MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

flash-atten:

>>> import flash_attn
>>> 
>>> print(flash_attn.__version__)
1.0.9

log:
work@codeserver-a6930cd6-204c-4627-b625-c62d8be57437-584f7cdb8csbcpz:~/llm_service$ lmdeploy serve api_server OpenGVLab/InternVL2-2B-AWQ --backend turbomind --cache-max-entry-count 0.2 --vision-max-batch-size 256 --server-port 23333 --model-format awq
Fetching 24 files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 24/24 [00:00<00:00, 207982.02it/s]
/usr/local/lib/python3.10/dist-packages/torchvision/datapoints/init.py:12: UserWarning: The torchvision.datapoints and torchvision.transforms.v2 namespaces are still Beta. While we do not expect major breaking changes, some APIs may still change according to user feedback. Please submit any feedback you may have in this issue: pytorch/vision#6753, and you can also check out pytorch/vision#7319 to learn more about the APIs that we suspect might involve future changes. You can silence this warning by calling torchvision.disable_beta_transforms_warning().
warnings.warn(_BETA_TRANSFORMS_WARNING)
/usr/local/lib/python3.10/dist-packages/torchvision/transforms/v2/init.py:54: UserWarning: The torchvision.datapoints and torchvision.transforms.v2 namespaces are still Beta. While we do not expect major breaking changes, some APIs may still change according to user feedback. Please submit any feedback you may have in this issue: pytorch/vision#6753, and you can also check out pytorch/vision#7319 to learn more about the APIs that we suspect might involve future changes. You can silence this warning by calling torchvision.disable_beta_transforms_warning().
warnings.warn(_BETA_TRANSFORMS_WARNING)
FlashAttention2 is not installed.
InternLM2ForCausalLM has generative capabilities, as prepare_inputs_for_generation is explicitly overwritten. However, it doesn't directly inherit from GenerationMixin. From 👉v4.50👈 onwards, PreTrainedModel will NOT inherit from GenerationMixin, and this model will lose the ability to call generate and other related functions.

If you're using trust_remote_code=True, you can get rid of this warning by loading the model with an auto class. See https://huggingface.co/docs/transformers/en/model_doc/auto#auto-classes
If you are the owner of the model architecture code, please modify your model class such that it inherits from GenerationMixin (after PreTrainedModel, otherwise you'll get an exception).
If you are not the owner of the model architecture class, please contact the model code owner to update it.
Warning: Flash attention is not available, using eager attention instead.
[WARNING] gemm_config.in is not found; using default GEMM algo

lvhan028 · 2024-11-21T11:13:18Z

lvhan028
Nov 21, 2024
Maintainer

You can't unless changing the model repo's code https://huggingface.co/OpenGVLab/InternVL2-2B-AWQ/blob/99715a396095f885f96243ffd9796e35fa3a679d/modeling_internlm2.py#L79

For the vision model inference, lmdeploy didn't rewrite it. Instread, it calls the model repo's function to conduct the vision embedding.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to specify to use flash-atten instead of flash-atten2 (GPU is v100) #2784

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

How to specify to use flash-atten instead of flash-atten2 (GPU is v100) #2784

huangwei907781034 Nov 21, 2024

Replies: 1 comment

lvhan028 Nov 21, 2024 Maintainer

huangwei907781034
Nov 21, 2024

lvhan028
Nov 21, 2024
Maintainer