diff --git a/README.md b/README.md index 91ac1d4134..61c0eba45b 100644 --- a/README.md +++ b/README.md @@ -115,6 +115,7 @@ For detailed inference benchmarks in more devices and more settings, please refe
  • Llama2 (7B - 70B)
  • Llama3 (8B, 70B)
  • Llama3.1 (8B, 70B)
  • +
  • Llama3.2 (1B, 3B)
  • InternLM (7B - 20B)
  • InternLM2 (7B - 20B)
  • InternLM2.5 (7B)
  • diff --git a/README_ja.md b/README_ja.md index ea4480e282..999ebc9f0b 100644 --- a/README_ja.md +++ b/README_ja.md @@ -114,6 +114,7 @@ LMDeploy TurboMindエンジンは卓越した推論能力を持ち、さまざ
  • Llama2 (7B - 70B)
  • Llama3 (8B, 70B)
  • Llama3.1 (8B, 70B)
  • +
  • Llama3.2 (1B, 3B)
  • InternLM (7B - 20B)
  • InternLM2 (7B - 20B)
  • InternLM2.5 (7B)
  • diff --git a/README_zh-CN.md b/README_zh-CN.md index cdddb64a22..f002899c60 100644 --- a/README_zh-CN.md +++ b/README_zh-CN.md @@ -116,6 +116,7 @@ LMDeploy TurboMind 引擎拥有卓越的推理能力,在各种规模的模型
  • Llama2 (7B - 70B)
  • Llama3 (8B, 70B)
  • Llama3.1 (8B, 70B)
  • +
  • Llama3.2 (1B, 3B)
  • InternLM (7B - 20B)
  • InternLM2 (7B - 20B)
  • InternLM2.5 (7B)
  • diff --git a/docs/en/quantization/w4a16.md b/docs/en/quantization/w4a16.md index 3a04cd7b05..32dfe18d80 100644 --- a/docs/en/quantization/w4a16.md +++ b/docs/en/quantization/w4a16.md @@ -69,7 +69,7 @@ lmdeploy serve gradio ./internlm2_5-7b-chat-4bit --server_name {ip_addr} --serve ## Evaluation -Please refer to [OpenCompass](https://opencompass.readthedocs.io/en/latest/index.html) about model evaluation with LMDeploy. +Please refer to [OpenCompass](https://opencompass.readthedocs.io/en/latest/index.html) about model evaluation with LMDeploy. Here is the [guide](https://opencompass.readthedocs.io/en/latest/advanced_guides/evaluation_lmdeploy.html) ## Inference diff --git a/docs/en/supported_models/supported_models.md b/docs/en/supported_models/supported_models.md index c992f730c8..260120efe0 100644 --- a/docs/en/supported_models/supported_models.md +++ b/docs/en/supported_models/supported_models.md @@ -10,6 +10,7 @@ The following tables detail the models supported by LMDeploy's TurboMind engine | Llama2 | 7B - 70B | LLM | Yes | Yes | Yes | Yes | | Llama3 | 8B, 70B | LLM | Yes | Yes | Yes | Yes | | Llama3.1 | 8B, 70B | LLM | Yes | Yes | Yes | Yes | +| Llama3.2 | 3B | LLM | Yes | Yes | Yes | Yes | | InternLM | 7B - 20B | LLM | Yes | Yes | Yes | Yes | | InternLM2 | 7B - 20B | LLM | Yes | Yes | Yes | Yes | | InternLM2.5 | 7B | LLM | Yes | Yes | Yes | Yes | @@ -20,7 +21,6 @@ The following tables detail the models supported by LMDeploy's TurboMind engine | Qwen2 | 1.5B - 72B | LLM | Yes | Yes | Yes | Yes | | Mistral | 7B | LLM | Yes | Yes | Yes | - | | Qwen-VL | 7B | MLLM | Yes | Yes | Yes | Yes | -| Qwen2-VL | 2B, 7B, 72B | MLLM | Yes | Yes | Yes | - | | DeepSeek-VL | 7B | MLLM | Yes | Yes | Yes | Yes | | Baichuan | 7B | LLM | Yes | Yes | Yes | Yes | | Baichuan2 | 7B | LLM | Yes | Yes | Yes | Yes | @@ -49,6 +49,7 @@ The TurboMind engine doesn't support window attention. Therefore, for models tha | Llama2 | 7B - 70B | LLM | Yes | Yes | Yes | Yes | Yes | | Llama3 | 8B, 70B | LLM | Yes | Yes | Yes | Yes | Yes | | Llama3.1 | 8B, 70B | LLM | Yes | Yes | Yes | No | - | +| Llama3.2 | 1B, 3B | LLM | Yes | Yes | Yes | No | - | | Llama3.2-VL | 8B, 90B | MLLM | Yes | Yes | Yes | No | - | | InternLM | 7B - 20B | LLM | Yes | Yes | Yes | Yes | - | | InternLM2 | 7B - 20B | LLM | Yes | Yes | Yes | Yes | Yes | diff --git a/docs/zh_cn/quantization/w4a16.md b/docs/zh_cn/quantization/w4a16.md index b61b894781..d50e464af3 100644 --- a/docs/zh_cn/quantization/w4a16.md +++ b/docs/zh_cn/quantization/w4a16.md @@ -72,7 +72,7 @@ lmdeploy serve gradio ./internlm2_5-7b-chat-4bit --server-name {ip_addr} --serve ## 模型评测 -我们使用 [OpenCompass](https://opencompass.readthedocs.io/zh-cn/latest/index.html) 评测量化模型在各个维度上的能力 +我们使用 [OpenCompass](https://opencompass.readthedocs.io/zh-cn/latest/index.html) 评测量化模型在各个维度上的能力。方法请参考[此处](https://opencompass.readthedocs.io/zh-cn/latest/advanced_guides/evaluation_lmdeploy.html) ## 模型推理 diff --git a/docs/zh_cn/supported_models/supported_models.md b/docs/zh_cn/supported_models/supported_models.md index 695103b52e..26930cf3ce 100644 --- a/docs/zh_cn/supported_models/supported_models.md +++ b/docs/zh_cn/supported_models/supported_models.md @@ -10,6 +10,7 @@ | Llama2 | 7B - 70B | LLM | Yes | Yes | Yes | Yes | | Llama3 | 8B, 70B | LLM | Yes | Yes | Yes | Yes | | Llama3.1 | 8B, 70B | LLM | Yes | Yes | Yes | Yes | +| Llama3.2 | 3B | LLM | Yes | Yes | Yes | Yes | | InternLM | 7B - 20B | LLM | Yes | Yes | Yes | Yes | | InternLM2 | 7B - 20B | LLM | Yes | Yes | Yes | Yes | | InternLM2.5 | 7B | LLM | Yes | Yes | Yes | Yes | @@ -20,7 +21,6 @@ | Qwen2 | 1.5B - 72B | LLM | Yes | Yes | Yes | Yes | | Mistral | 7B | LLM | Yes | Yes | Yes | - | | Qwen-VL | 7B | MLLM | Yes | Yes | Yes | Yes | -| Qwen2-VL | 2B, 7B, 72B | MLLM | Yes | Yes | Yes | - | | DeepSeek-VL | 7B | MLLM | Yes | Yes | Yes | Yes | | Baichuan | 7B | LLM | Yes | Yes | Yes | Yes | | Baichuan2 | 7B | LLM | Yes | Yes | Yes | Yes | @@ -49,6 +49,7 @@ turbomind 引擎不支持 window attention。所以,对于应用了 window att | Llama2 | 7B - 70B | LLM | Yes | Yes | Yes | Yes | Yes | | Llama3 | 8B, 70B | LLM | Yes | Yes | Yes | Yes | Yes | | Llama3.1 | 8B, 70B | LLM | Yes | Yes | Yes | No | - | +| Llama3.2 | 1B, 3B | LLM | Yes | Yes | Yes | No | - | | Llama3.2-VL | 8B, 90B | MLLM | Yes | Yes | Yes | No | - | | InternLM | 7B - 20B | LLM | Yes | Yes | Yes | Yes | - | | InternLM2 | 7B - 20B | LLM | Yes | Yes | Yes | Yes | Yes | diff --git a/lmdeploy/model.py b/lmdeploy/model.py index f251ca18d2..26ab856bc2 100644 --- a/lmdeploy/model.py +++ b/lmdeploy/model.py @@ -772,7 +772,7 @@ def match(cls, model_path: str) -> Optional[str]: return 'llama3' -@MODELS.register_module(name='llama3_1') +@MODELS.register_module(name=['llama3_1', 'llama3_2']) class Llama3_1(Llama3): """Chat template of LLaMA3.1 model.""" diff --git a/lmdeploy/turbomind/supported_models.py b/lmdeploy/turbomind/supported_models.py index 8ebb93fdf2..bdf129b019 100644 --- a/lmdeploy/turbomind/supported_models.py +++ b/lmdeploy/turbomind/supported_models.py @@ -84,9 +84,9 @@ def _is_head_dim_128(cfg): if num_attn_head == 40: # baichuan-13B, baichuan2-13B not supported by turbomind support_by_turbomind = False - elif arch == 'Qwen2ForCausalLM': - # qwen2 0.5b size_per_head is 64, which hasn't been supported - # by turbomind yet + elif arch in ['Qwen2ForCausalLM', 'LlamaForCausalLM']: + # the head_dim of qwen2 0.5b and llama3.2-1b is 64, which + # hasn't been supported by turbomind yet support_by_turbomind = _is_head_dim_128(cfg) elif arch in ('ChatGLMModel', 'ChatGLMForConditionalGeneration'): # chatglm1/2/3 is not working yet