How to specify to use flash-atten instead of flash-atten2 (GPU is v100) #2784
Unanswered
huangwei907781034
asked this question in
Q&A
Replies: 1 comment
-
You can't unless changing the model repo's code https://huggingface.co/OpenGVLab/InternVL2-2B-AWQ/blob/99715a396095f885f96243ffd9796e35fa3a679d/modeling_internlm2.py#L79 For the vision model inference, lmdeploy didn't rewrite it. Instread, it calls the model repo's function to conduct the vision embedding. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
nvidia-smi:
flash-atten:
log:
work@codeserver-a6930cd6-204c-4627-b625-c62d8be57437-584f7cdb8csbcpz:~/llm_service$ lmdeploy serve api_server OpenGVLab/InternVL2-2B-AWQ --backend turbomind --cache-max-entry-count 0.2 --vision-max-batch-size 256 --server-port 23333 --model-format awq
Fetching 24 files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 24/24 [00:00<00:00, 207982.02it/s]
/usr/local/lib/python3.10/dist-packages/torchvision/datapoints/init.py:12: UserWarning: The torchvision.datapoints and torchvision.transforms.v2 namespaces are still Beta. While we do not expect major breaking changes, some APIs may still change according to user feedback. Please submit any feedback you may have in this issue: pytorch/vision#6753, and you can also check out pytorch/vision#7319 to learn more about the APIs that we suspect might involve future changes. You can silence this warning by calling torchvision.disable_beta_transforms_warning().
warnings.warn(_BETA_TRANSFORMS_WARNING)
/usr/local/lib/python3.10/dist-packages/torchvision/transforms/v2/init.py:54: UserWarning: The torchvision.datapoints and torchvision.transforms.v2 namespaces are still Beta. While we do not expect major breaking changes, some APIs may still change according to user feedback. Please submit any feedback you may have in this issue: pytorch/vision#6753, and you can also check out pytorch/vision#7319 to learn more about the APIs that we suspect might involve future changes. You can silence this warning by calling torchvision.disable_beta_transforms_warning().
warnings.warn(_BETA_TRANSFORMS_WARNING)
FlashAttention2 is not installed.
InternLM2ForCausalLM has generative capabilities, as
prepare_inputs_for_generation
is explicitly overwritten. However, it doesn't directly inherit fromGenerationMixin
. From 👉v4.50👈 onwards,PreTrainedModel
will NOT inherit fromGenerationMixin
, and this model will lose the ability to callgenerate
and other related functions.trust_remote_code=True
, you can get rid of this warning by loading the model with an auto class. See https://huggingface.co/docs/transformers/en/model_doc/auto#auto-classesGenerationMixin
(afterPreTrainedModel
, otherwise you'll get an exception).Warning: Flash attention is not available, using eager attention instead.
[WARNING] gemm_config.in is not found; using default GEMM algo
Beta Was this translation helpful? Give feedback.
All reactions